1 Introduction

Autonomous driving is often equipped with different types of sensors to handle various complex driving environments and to improve system robustness and accuracy. With the rapid development of various sensors, an increasing number of sensor fusion algorithms have emerged, and among them, 3D object detection has the most significant development tendency [1,2,3,4,5]. Most fusion algorithms currently focus on 3D object detection fused by a camera and lidar, which have achieved excellent results, such as multi-view 3D object detection (MV3D) [1] and Frustum PointNets [3]. The advantages of each sensor, particularly in complex environments, can be fully utilized and combined by fusion algorithms. A monocular camera can provide rich RGB information, which can be quickly processed and analyzed by algorithms. Therefore, the detection result can be obtained quickly. However, depth information and the shape contour of an object are difficult to obtain. Lidar has very high accuracy at short distances, which copes with complex environments, including many vehicles, pedestrians, and buildings. However, it loses its detection ability in long-distance detection because of too sparse point clouds as the effective distance is within 100m. The combination of these two types of sensors is sufficient for most normal environments. However, both a camera and lidar, which do not have long-range detection capabilities, are subject to strong interference in bad weather conditions. Radar distinguishes objects by emitting millimeter waves, which has the characteristics of all weather conditions, all day, and all night. Even during bad weather conditions, radar still has a good detection capability, and its detection distance is two or three times longer than that of lidar. Furthermore, the radar point cloud is sparse, which is faster to process than the lidar point cloud. Therefore, radar is fused with a camera and lidar for 3D object detection in this study.

Generally, object rotation information is crucial for driving scenarios. The rotation information of an object is often not obtained using conventional algorithms for anchor-based object detection. Moreover, the rotation invariance and scale invariance of point clouds are not considered by the point cloud model. In PointNet [6], T-net is used to the learn rotation features of point clouds. However, if no data enhancement occurs, the effect of the model remains greatly affected by simply rotating an object. Meanwhile, a conventional point cloud network [7,8,9], based on common coordinate systems, is limited by the disorder of point clouds as the rotation characteristics of point clouds are difficult to capture. In recent years, center-based methods have been widely used in object detection for their ability of to adapt well to rotational models without complex post-processing. For multi-sensor fusion methods, image detection results of 2D methods are often projected into a 3D point cloud to form a 3D region of interest (ROI) space. Another method is by using the ROI generated from a 2D image to limit the search space of point clouds, which can significantly reduce the amount of computation. This study also adopted this technique to generate a frustum through 2D detection boxes and furthermore combine depth information to determine the ROI in point clouds.

Multi-modality fusion for autonomous driving includes pixel, feature, and decision-level fusion. Most studies are presently focused on pixel-level fusion and have achieved remarkable results and reached the application-level. However, object detection of pixel-level fusion has a poor real-time performance, which is limited by different types of sensor fusion. Meanwhile, a large amount of information is lost in decision-level fusion, and its recognition ability is poor. Object features are extracted from the information of each sensor through feature-level fusion, and feature quantity is obtained using a fusion algorithm to detect objects. Feature-level fusion not only maintains a sufficient amount of valid information and removes redundant information from the object but also improves the detection accuracy. Therefore, a method originated from feature-level fusion is proposed, that is, segmentally fusing features from the three types of sensors. The ROI spaces in the lidar and radar point clouds are determined by a frustum association method using the detection frame of a monocular camera at a close distance. For the detection by the lidar point cloud, a method of spherical voxelization of the point cloud is proposed based on the core concept of a center-based method. A rotation-invariant feature is extracted by spherical voxel convolution and trilinear interpolation, and then the rotation direction of the object is further determined. Moreover, a dynamic adaptive neural network of parameters is used to perform feature-level late fusion, where a fusion feature is used to improve the detection results and supplement object attributes. To predict the position and direction of pedestrians, a monocular camera and radar are used at a long distance so that the autonomous driving system can perform path planning and provide advance warning.

The following are the innovations of this paper:

  • Based on the core concept of a center-based method, the center point of an object is detected, and an object model is constructed. Further, the irregular point cloud is spherically voxelized with the center point as the center of sphere. Spherical voxel convolution and trilinear interpolation are used to extract the rotation-invariant feature of points and can obtain the rotation information of an object.

  • Based on the different characteristics of the three types of sensors, a segmental distributed feature-level fusion is adopted. Meanwhile, a frustum association method is used to correspond lidar and radar point clouds to the frustum of the ROI, which generated by visual inspection. Furthermore, according to depth, scale, and other types of information, the scope of the ROI is further reduced to remove all irrelevant points outside the scope. Thus, the detection speed and accuracy are improved.

  • A dynamic adaptive neural network of parameters is proposed for feature fusion, which solves the divergence problem of fusion networks and improves the operation efficiency of the proposed algorithm. The detection results are improved by the optimized fusion feature, and the robustness of the proposed algorithm is improved as well.

2 Related work

Generally, 3D object detection algorithms for multi-modal fusion can be divided into two categories: early and late fusion. Owing to the continuous development of fusion algorithms, all algorithms cannot be fully included in the above classification scheme [10, 11]. Therefore, multi-modal fusion methods for object detection can be divided into two categories: sequential and parallel fusion.

2.1 Methods based on sequential fusion

Sequential fusion means that latter stage relies on the processing results of previous stage, and multi-level features are used in sequence. Qi et al. proposed Frustum PointNets [3] for 3D object detection. The ROI was first extracted by a 2D detector, and then 2D coordinates were transformed into a 3D space to obtain region proposals from a frustum. The frustum was segmented into blocks to obtain the points of interest for further regression. The method achieved good results by restricting search in a 3D space using a well-established 2D detection method. F-ConvNet [12] was proposed based on the above cascaded method, where the frustum sequence between near and far planes was generated by 2D region proposals. The point-by-point feature in the point cloud was converted into feature vectors at the frustum level and hence improved the running efficiency of the algorithm. CenterFusion [13] based on camera and lidar fusion was proposed based on the concept of a frustum. Preliminary detection results obtained from lidar data and images were associated. 3D bounding boxes of objects and 3D properties (e.g., depth, velocity, and rotation) were estimated by combining them with image features. Since radar was used for attribute regression, this method sacrificed, to some extent, object accuracy (e.g., geometric information of the object) for sufficient object attributes. Tao et al. proposed F-PVNet [14], which made full use of local sensitive points and contextual features by using a frustum to group local points and aggregate them with features obtained from submanifold voxel convolution. Tao et al. proposed a ground culling algorithm for 3D object detection, which reduced the amount of computation to a certain extent and accelerated the running speed [15]. It is an attempt to remove the irrelevant point cloud idea. Point cloud projection is also a sequential fusion method. Vora et al. proposed a method called PointPainting for 3D object detection [16]. 2D semantic segmentation was first performed through a semantic network, and then lidar points were projected into the segmentation mask according to a transformation matrix. Finally, a 3D detector was used for classification and localization. Semantic segmentation information was supplemented by this method with lidar detection, which continuously enhanced existing networks with segmentation scores [17, 18]. However, only depending on semantic segmentation itself, the fusion method has a counterproductive result due to the too low semantic segmentation accuracy. Pseudo-LiDAR [19] was proposed by Wang et al. as another attempt at sequential fusion. A pyramid stereo matching network [20] was first used to estimate depth information. Then each pixel in the image was back-projected into a 3D space, generating a Pseudo-LiDAR signal similar to the lidar point cloud. Finally, the existing lidar detector was used for detection. Based on Pseudo-LiDAR, You et al. proposed Pseudo-LiDAR++, which used very few real and accurate lidar points to correct depth estimation bias. Nakrani et al. attempted to solve the problem of not end-to-end in Pseudo-LiDAR [21]. The technique of changing the representation data was considered to address the problem of poor vision-based depth estimation, providing a valuable idea for image-only perception algorithms.

2.2 Methods based on parallel fusion

Parallel fusion indicates that each fusion stage is carried out simultaneously. Multiple modalities are first fused to obtain a representation result and then input to the network, or each modality is processed by its respective network and then fused. These methods often employ different approaches to integration; they have a wide variety, but lack uniform standards. MV3D [1] was proposed by Chen et al., which only used image and bird’s eye view (BEV) of point clouds. The amount of computation was reduced while retaining enough information. The ROI was first extracted from the BEV, which was then projected into the image and the front view of the point cloud. After pooling and integrating into the same dimensional information, features were finally extracted and fused. However, this method has a drawback in small object detection. Unlike MV3D, Ku et al. proposed a 3D object detection method called AVOD [22], which performed fusion before the region proposal stage. Feature maps, including images and the BEV of point clouds, were generated through the FPN [23] network. A 3D anchor frame was used to select corresponding regions in both of them for fusion. The fusion result was finally input into the fully connected layer for 3D object detection. To avoid information loss caused by the point cloud projection, Xie et al. proposed PI-RCNN [24] by using a new fusion method to fuse 2D semantic segmentation into 3D region proposals. Pixel-level fusion is also a form of parallel fusion. Liang et al. proposed ContFuse [25], which used continuous convolution to fuse multi-sensor features at multiple scales through pixel-level fusion. ResNet-18 [26] was first used to extract features from the image and BEV of a point cloud and then adopted multi-scale feature fusion for the image. Then, PCCN [27] was used to fuse it into the BEV to achieve 3D object detection. Yoo et al. proposed a 3D-CVF [28] method.The point cloud was voxelized and then transformed into a 2D BEV feature map through sparse convolution. Then, ResNet-18 [26] was used to extract image feature and fused them into BEV feature maps. The problem of misplaced views was solved by using this method, but feature blur [16] inevitably led to bias due to pixel-level fusion. While a multitasking problem is often solved using different networks, MMF [29] used a single network to solve a multi-tasking problem, which achieved point-wise and ROI-wise feature fusion. EPNet [30], proposed by Huang et al., was a lidar-guided image fusion method. A point-by-point correspondence was directly established between the original point cloud data and image, and the importance of semantic information was estimated to enhance useful features and suppress interfering ones. 4D-Net [31], designed by Piergiovanni et al., combined image, lidar, and temporal information. Moreover, motion cues were better used in the dynamic connection learning method. Although the performance of fusion algorithms has been greatly improved, a certain gap remained compared with algorithms only based on lidar [32, 33]. Additionally, Cao et al. proposed an accelerated point-voxel representation [34], which fused the features of points and voxels into a single 3D representation. Wang et al. proposed BrT [35], which unified multimodal data from different sources by transformer and achieved seamless fusion of point clouds with multi-view images using an aggregated form of point-to-patch projection. Inspired by the application of transformer in 2D vision tasks, Gao et al. proposed LFT-Net [36] to solve the local feature extraction problem in point cloud segmentation tasks by associating local features with point clouds through a local feature transformer network.

Overall, both types of above methods have advantages and disadvantages. The sequential fusion method is similar to cascade, and the processing results of previous stage directly affect the effectiveness of the latter stage. No direct and strong correlation exists between the stages of the parallel fusion method, but the misalignment of views between sensors is a problem that should be solved by each algorithm. Therefore, combining the advantages of these two types of methods, a multi-sensor segmentation fusion method based on a frustum is proposed, and sensor fusion is performed in two stages. The first stage is based on sequential fusion, where 2D region proposals in an image, preliminary 3D detection boxes, and center points are obtained by CenterNet [37]. A frustum is generated to determine the ROI in the lidar and radar point clouds, filtering out invalid information in the point cloud. In the second stage, the concept of parallel fusion is used, where three types of sensor data are extracted through their respective networks to perform feature fusion. Feature fusion is used to reduce the inaccuracy of initial detection results. The first stage improves the efficiency of feature extraction in the second stage, whereas the second stage reduces the errors caused by the cascade in the first stage.

Fig. 1
figure 1

Network frame diagram of the proposed method (HM, heat map; Off, offset; WH, width and height; Dim, dimension; Dep, depth; Rot, rotation; Vel, velocity; Att, attributes)

3 Algorithm of multi-sensor segmental fusion of frustum association

In this paper, a segmental fusion association algorithm is proposed based on three sensors: camera, lidar, and radar for 3D object detection. The detection range of the point cloud is narrowed by using a frustum association method. Then, the lidar point cloud is spherically voxelized. Furthermore, the rotation-invariant feature is extracted through a spherical voxel convolution. Meanwhile, a neural network with dynamic parameter adaptation is used to perform feature-level fusion for the improvement of the detection results. Figure 1 shows the framework of the proposed method.

Firstly, the fully convolutional network is used to obtain the 2D detection frame and center point of objects in the image. At a short distance, ROI in the lidar and radar point clouds which extended to the pillar are determined by the frustum association method. Thereafter, spherical voxelization and spherical voxel convolution are performed on the lidar point cloud to extract the rotation-invariant feature. Finally, feature-level fusion is performed with object attributes, which are extracted from the image and radar point cloud to improve the detection results and generate a feature map. Furthermore, lidar point clouds that are too sparse are not used at a long distance.

3.1 Generation of detection boxes and center points

CenterNet is used to generate detection boxes and center points for frustum association, while object-related properties, such as scale, depth, and 3D position, are regressed. As a representative among anchor-free series of algorithms, CenterNet represents an object as a center point when creating a model, which addresses some problems of anchor-based methods [38]. The CenterNet network uses \(I \in R^{W \times H \times 3}\) as input image, W and H are the width and height of the image, respectively. Then, a keypoint heatmap is generated as follows:

$$\begin{aligned} \hat{Y} \in [0,1]^{\frac{W}{R} \times \frac{H}{R} \times C}, \end{aligned}$$
(1)

where R is the output stride and C is the number of object categories. The detected object of class c is output as \(\hat{Y}_{x, y, c}=1\), and its center point is (xy). The output of the area where no object is detected is represented by \(\hat{Y}_{x, y, c}=0\).

Each keypoint of the ground truth at position \(r \in R^{2}\) is equivalently replaced with the corresponding keypoint \(\tilde{r}=\left[ \frac{r}{R}\right] \) on the downsampled low-resolution image. The keypoints of the ground truth are passed through a Gaussian kernel function to scatter onto the heatmap of the ground truth.

$$\begin{aligned} Y_{x y c}=\exp \left( -\frac{\left( x-\tilde{r}_{x}\right) ^{2}+\left( y-\tilde{r}_{y}\right) ^{2}}{2 \sigma _{r}^{2}}\right) , \end{aligned}$$
(2)

where \(\sigma _{r}\) is the standard deviation of object size adaptation and \(Y_{x y c} \in [0,1]^{\frac{W}{R} \times \frac{H}{R} \times C}\). If two overlapping Gaussian functions exists for the same class c, the element-wise largest one is selected.

Object information, including depth, dimension, and orientation, is regressed from the detected center points to generate 3D detection boxes through CenterNet. The depth is computed and output as an additional channel \(\widehat{D} \in [0,1]^{\frac{W}{R} \times \frac{H}{R}}\). The dimension contains three scalars, which are directly regressed to their absolute value via \(\hat{\Gamma } \in [0,1]^{\frac{W}{R} \times \frac{H}{R} \times 3}\). Orientation is represented as two bins, and each bin contains four scalars for encoding. To avoid discretization errors in the network due to output strides, a local drift is also computed for each center point.

The training objective function is defined as follows:

$$\begin{aligned} L_{k}=\frac{1}{N} \sum _{x y c}\left\{ \begin{array}{c} \left( 1-\hat{Y}_{x y c}\right) ^{\alpha } \log \hat{Y}_{x y c} \quad Y_{x y c}=1 \\ \!\left( 1-Y_{x y c}\right) ^{\beta }\left( \hat{Y}_{x y c}\right) ^{\alpha } \log \left( 1-\hat{Y}_{x y c}\right) \! \quad \text {otherwise } \end{array}\right. , \end{aligned}$$
(3)

where N is the number of targets and \(\alpha \) and \(\beta \) are the hyperparameters of focal loss [39].

Fig. 2
figure 2

Schematic BEV (left). Schematic diagram of frustum generation (right)

Fig. 3
figure 3

Actual calibration results

3.2 Frustum association

The precise 2D detection frame, rough 3D detection frame of each object in a scene, and center point of the object can be obtained through CenterNet. To fully use radar information and reduce irrelevant calculations, this paper proposes a result-level fusion method of frustum association.

A ROI frustum is created for the object by using the 2D detection frame, which is obtained from the image as well as the depth and size of the estimated object. As shown in Fig. 2, all irrelevant point cloud data outside the view frustum can be filtered out by mapping the frustum to the point cloud, which effectively reduces the computational load of subsequent point cloud detection. Meanwhile, the proposed method solves the object overlapping problem in 2D image detection. As objects are separated in a 3D point cloud, separated ROI frustums can be created for 2D overlapping objects to more accurately detect the overlapping objects in the segmented 2D image.

Unlike the lidar point cloud, the radar point cloud has the problem of inaccurate Z dimension or no Z dimension at all, resulting in inaccurate height information of the object. Therefore, a preprocess method for pillar expansion of the radar point cloud is proposed. Each radar detection point is expanded into a fixed pillar, which is associated with the Z dimension in a 3D space. A portion of radar detection is considered within the ROI if the corresponding strut is located fully or partially within the ROI frustum.

Different types of sensors are not synchronized temporally and spatially when acquiring data. Due to their different acquisition cycles and perspectives, aligning the camera and lidar, as well as the camera and radar, is essential before correlation, and then the fusion can be performed.

3.2.1 Camera and lidar calibration

Feature alignment of the calibration plate in this study is performed to estimate the parameters in the calibration of the camera and lidar [40]. The plane normal of the calibration plate is defined as \(n_L\), rotation matrix as \(R_C^L\), camera normal matrix as \(N_C=\left[ n_C^0, n_C^1, n_C^2\right] ^T\), and lidar normal matrix as \(N_L=\left[ n_L^0, n_L^1, n_L^2\right] ^T\). The centroid of the calibration plate and the plane normal \(n_L\) are firstly extracted [41]. The rotation matrix \(R_C^L\) can be aligned with the lidar normal matrix \(N_L\) by rotating the camera normal matrix \(N_C\) using the following equation:

$$\begin{aligned} R_C^L N_C=N_L, \end{aligned}$$
(4)

where \(N_C\) and \(N_L\) are known quantities; hence, \(R_C^L\) can be found.

Although existing methods often include more samples in the computation to improve their robustness [42], they also tend to over-tune the calibration plate. Therefore, three bit-pose sets are selected to fully constrain Eq. (4). Furthermore, \(N_C\) and \(N_L\) are a formed square matrix for the analysis [40].

During solving \(R_C^L\) in Eq. (4), \(N_C^{-1}\) should be calculated. Therefore, the linear correlation is identified in the normal matrix, which is important for the accuracy of the calibration results. Moreover, lidar is subject to measurement errors. Matrix condition numbers are used to evaluate the linear correlation and quality of the rotation parameters. Meanwhile, errors in calibration plate measurements are used to evaluate the translation parameters. The variability of quality (VOQ) is defined as

$$\begin{aligned} V O Q=\kappa _{L C}+e_{b e}, \end{aligned}$$
(5)

where \(\kappa _{L C}\) denotes the most unstable result in the inverse matrix of \(N_L\) and \(N_C\) and \(e_{b e}\) denotes the average plate error of the three bit poses. Thus, a lower VOQ score indicates better alignment. Figure 3 shows the actual calibration.

Fig. 4
figure 4

Transformation diagram of the coordinate system

3.2.2 Camera and radar calibration

The calibration of the camera and radar is relatively easy. Since the radar only obtains the \(\textrm{X}\) and \(\textrm{Y}\)-axis coordinates of the object, the coordinate system conversion between them is a conversion of a 2D \( \textrm{X}-\textrm{Y}\) coordinate system.

The camera coordinate system is defined as \(O_C=\left[ x_C, y_C, 1\right] ^T\), the radar coordinate system as \(O_R=\left[ x_R, y_R, 1\right] ^T\), and the rotation angle as \(\theta \). Then, the conversion relationship of the coordinate system is expressed as follows:

$$\begin{aligned} \left[ \begin{array}{c} x_C \\ y_C \\ 1 \end{array}\right] =\left[ \begin{array}{ccc} \cos \theta &{} -\sin \theta &{} x_t \\ \sin \theta &{} \cos \theta &{} y_t \\ 0 &{} 0 &{} 1 \end{array}\right] \times \left[ \begin{array}{c} x_R \\ y_R \\ 1 \end{array}\right] , \end{aligned}$$
(6)

where \(x_t\) and \(y_t\) are the translations in the \(\textrm{X}\) and \(\textrm{Y}\)-axis directions(see Fig. 4).

3.3 Spherical voxelization of the lidar point cloud

Vehicles in driving scenes often have a certain rotation relative to their driving direction. Considering the influence of multiple factors (e.g. road slope and curves), this rotation can be arbitrary. Control of the object direction is also a key factor in predicting the movement of the objects and preventing collisions.

After the interest part of the lidar point cloud and its center point through frustum association are obtained, spherical voxelization and spherical voxel convolution [43] are used to classify the objects. Then, the rotation-invariant feature and object rotations information are extracted. To convert the point cloud into spherical voxels with a Euclidean structure, a density-aware adaptive sampling (DAAS) method is proposed, rather than uniform sampling, to solve the problem of sparse points around the poles and dense points around the equator (Fig. 5). This problem can lead to bias in the spherical signal; that is, the point cloud is not uniformly distributed, resulting in a failure to align the feature extracted from the spherical voxel convolution with the original point cloud feature. The DAAS method samples the point cloud at both poles using a wider filter to adjust for density differences, and the exacted implementation is given in Eq. (7).

Fig. 5
figure 5

Spherical voxelization of the lidar point cloud. The closer to the poles, the wider the filter, a technique to adjust for the differences in sampling density

Fig. 6
figure 6

Spherical voxel convolution

A unit sphere is defined as a set of points \(x \in \mathbb {R}^{3}\), which is normalized to norm. The spherical voxel space is defined as \(S^{2} \times H\). Moreover, points in the space are described by \((\alpha , \beta , h)\), where \((\alpha , \beta ) \in S^{2}\); \(\alpha \) and \(\beta \) are the polar and azimuth angles, respectively; and h is the straight-line distance from the point to the center of the sphere. The position of a spherical voxel is determined by its center \(\left( a_{i}, b_{j}, c_{k}\right) \), where \((i, j, k) \in I \times J \times K\). \(I \times J \times K\) is the spatial resolution, called bandwidth. The coordinate of the nth point in \(S^{2} \times H\) is \(\left( \alpha _{n}, \beta _{n}, h_{n}\right) \). Furthermore, a total of N points exists. The calculation formula of spherical signal \(f: S^{2} \times H \rightarrow \mathbb {R}\) is expressed as

$$\begin{aligned} f\left( a_{i}, b_{j}, c_{k}\right) =\frac{\sum _{n=1}^{N} \omega _{n} \cdot \left( \delta -\left\| h_{n}-c_{k}\right\| \right) }{\sum _{n=1}^{N} \omega _{n}}, \end{aligned}$$
(7)

where \(\omega _{n}\) is the normalization factor, which is defined as

$$\begin{aligned} \omega _{n}\!=\!1\left( \left\| \alpha _{n}\!-\!a_{i}\right\| \!<\!\delta \right) \cdot 1\left( \left\| \beta _{n}-b_{j}\right\| \!<\!\eta \delta \right) \cdot 1\left( \left\| h_{n}\!-\!c_{k}\right\| \!<\!\delta \!\right) \!, \end{aligned}$$
(8)

where \(\eta \) is the density-aware sampling factor, \(\delta \) is the pre-defined filter width, and \(\eta =\sin (\beta )\) is used to control f to adaptively sample point set under non-uniform density. Equation (9) is used to express information along the H-axis orthogonal to \(S^{2}\), which remains unchanged under random rotations.

$$\begin{aligned} \left( \delta -\left\| h_{n}-c_{k}\right\| \right) \in [0, \delta ]. \end{aligned}$$
(9)

3.4 Spherical voxel convolution

Spherical signal \(S^{2} \times H\) serves as input of the spherical voxel convolution. The rotation group SO(3) is a special orthogonal group, which can be transformed into the ZYZ-Euler angle \((\alpha , \beta , \gamma )\), where \(\alpha \in [0,2 \pi ], \beta \in [0, \pi ]\) and \(\gamma \in [0,2 \pi ]\). To conveniently extract the rotation-invariant feature, the rotation operator \(L_{R}\) of the spherical voxel signal is defined as

$$\begin{aligned} \left[ L_{R} f\right] (s)=f\left( R^{-1} s\right) , \end{aligned}$$
(10)

where \(R \in S O(3), s \in S^{2} \times H, f: S^{2} \times H \rightarrow \mathbb {R}\). The rotation only affects the spherical coordinate \(S^{2}\), which has no effect on the H domain (see Fig. 6).

Equation (8) is the calculation formula for the convolution of two spherical signals

$$\begin{aligned} {[\psi * f](p) }= & {} \left\langle L_{P} \psi , f\right\rangle \nonumber \\= & {} \int _{S^{2} \times H} \psi \left( P^{-1} s\right) f(s) d s, \end{aligned}$$
(11)

where, \(\psi \) represents the filter, \(f: S^{2} \times H \rightarrow \mathbb {R}, \quad p \in S^{2} \times H,\) and P represents the element corresponding to p in SO(3).

To prove the rotation invariance, the point cloud is assumed to be rotated by an arbitrary rotation matrix R. Then, for \(\forall p \in S^2 \times H, p \rightarrow R p\) is given. Since f is sampled from the input point cloud, the spherical signal rotation is represented by \(f \rightarrow L_R f\). Subsequently, the spherical voxel convolution was applied to the rotated input signal

$$\begin{aligned} {\left[ \psi * L_R f\right] } (R p)= & {} \left\langle L_{R P} \psi , L_R f\right\rangle \nonumber \\= & {} \int _{S^2 \times H} \psi \left( P^{-1} R^{-1} s\right) f\left( R^{-1} s\right) d s \nonumber \\= & {} \int _{S^2 \times H} \psi \left( P^{-1} s\right) f(s) d s \nonumber \\= & {} [\psi * f](p), \end{aligned}$$
(12)

where the simplification of the second-third step is from [43]. Therefore, the output of the spherical voxel convolution is not affected by the rotation. The rotation information of the object can be obtained by solving the transformation relationship between the corresponding feature (e.g., those of the tire and front of the car) and their counterparts in the standard orientation in the world coordinate system.

The spherical voxel convolution firstly converts the input and filter to a frequency spectrum via fast Fourier transform (FFT). Secondly, they are multiplied and converted to the spatial domain via inverse FFT (IFFT) [44]. Thirdly, point resampling is performed; that is, the feature is resampled at the original point positions. Trilinear interpolation is used as operator \(\Lambda : \mathbb {R}^{S^{2} \times H \times C} \rightarrow \mathbb {R}^{N \times C}\). The feature of each point is the weighted average values of eight nearest voxels from each point. The weight is inversely proportional to the distance between the point and each spherical voxel. A point-wise feature was obtained through fully connected layers.

3.5 Feature extraction and feature-level fusion

In practical applications, each single sensor can independently perform object detection and attribute regression. However, the advantage of the sensor often cannot be fully exerted. Three different sensors were used in this study to extract object features. An object feature was extracted through a fully convolutional network for predicting the center of the object and generating a bounding box. The extracted features include 2D size, 3D size, depth, rotation, and center offset of the object, which is used as the primary regression. The lidar point cloud extracts rotation-invariant features through a spherical voxel convolution, which is used to predict the true 3D rotation direction of an object. Furthermore, a detection frame was generated, which fits the rotation direction of the object. Radar detection can directly obtain the depth information of the object. Meanwhile, the object’s speed information can be extracted according to the Doppler effect. Eigenvalues in the eigenvectors were extracted by the three sensors through the network, which has constraint and complementary relations. According to these relations, a dynamic adaptive neural network of parameters was proposed to perform feature fusion on eigenvectors.

A late fusion method with distributed feature-level was adopted in this study. As shown in Fig. 7, a feature was extracted from three types of sensor data. Moreover, the dynamic adaptive neural network of parameters was proposed for feature fusion.

Fig. 7
figure 7

Schematic diagram of the feature-level late fusion

The network comprises an input layer, a hidden layer with Gauss function neurons, and an output layer with linear neurons. M and Q denote the number of neurons on the hidden and output layers, respectively. The input mode is \(X, X=\left[ \!x_{1}, x_{2}, \ldots , x_{R}\!\right] ^{T}\), and the output is \(Y, Y=\) \(\left[ \!y_{1}, y_{2}, \ldots , y_{Q}\!\right] ^{T}\!\). The output of the hidden unit is expressed as

$$\begin{aligned} Z_{j}=\exp \left( -\left\| \frac{X-C_{j}}{\sigma _{j}}\right\| \right) , \end{aligned}$$
(13)

where \(Z_{j}\) is the output value of the jth neuron in the hidden layer, \(j\!=\!1,2, \ldots , M\). Further, \(C_{j}\!=\!\) \(\left[ C_{j 1}, C_{j 2}, \ldots , C_{j R}\right] ^{T}\) is the center of the jth neuron in the hidden layer, composed of the center components of all neurons in the output layer that corresponds to the neuron. \(\sigma _{j}\) is the width of the jth neuron in the hidden layer that corresponds to \(C_{j}\).

The following expression presents the relationship between the input and output of the neurons in the output layer:

$$\begin{aligned} y_{k}=\sum _{j=1}^{M} w_{k j} Z_{j}, \end{aligned}$$
(14)

where \(y_{k}\) is the output value of the jth neuron in the output layer, \(k=1,2, \ldots , Q\), and \(w_{k j}\) is the weight between the kth and jth neurons in the output and hidden layers. Given that the neurons are enough in the hidden layer, the network can approximate any functions with any desired accuracy. Parameters such as the center and width of the neurons in the hidden layer and the weights of the output layer determine the network performance. If these parameters cannot be accurately determined, then network divergence may occur [45].

In this paper, a dynamic adaptive particle swarm optimization was proposed to optimize three parameters: center, base width, and weight vectors. The proposed algorithm was firstly initialized as a group of random particles. Then, the optimal solution was obtained through iteration. The particle updates itself by tracking two extrema: pbest and gbest [46]. After obtaining these two optimal values, the speed and position are updated by following Eq. (15)

$$\begin{aligned} \left\{ \begin{array}{c} v_{i}\!=\!\omega v_{i}+c_{1} \!\times \! {\text {rand }}() \!\times \!\left( \text{ pbest}_{i}\!-\!x_{i}\right) \!+\!c_{2} \!\times \! {\text {rand }}() \!\times \!\left( { gbest }_{i}\!-\!x_{i}\right) \\ x_{i}(t+1)=x_{i}(t)+v_{i}(t) \end{array},\right. \end{aligned}$$
(15)

where \(i=1,2, \ldots , N, N\) is the total number of particles, \(\omega \) is the inertia weight, \(v_{i}\) is the particle velocity, rand() is a random number between (0, 1), \(x_{i}\) is the current position of the particle, and \(c_{1}\) and \(c_{2}\) are the learning factors.

To increase its global convergence ability and avoid falling into local optimum in the early stage, the learning factor and inertia weight were dynamically and adaptively changing in this study. The initial value of learning factor \(c_{1}\) is denoted by \(c_{1 m}\), which is reduced to \(c_{1 m}^{\prime }\) in a nonlinear manner during iteration. The initial value of \(c_{2}\) is recorded as \(c_{2 m}\), which is increased to \(c_{2 m}^{\prime }\) in a non-linear manner. The initial value of \(\omega \) is recorded as \(\omega _{m}\), which is reduced to \(\omega _{m}^{\prime }\) in a non-linear manner. The following is the relevant formula:

$$\begin{aligned} \left\{ \begin{array}{l} c_{1}(k)=c_{1 m}^{\prime }+\left( \frac{m-i}{m}\right) ^{\alpha }\left( c_{1 m}-c_{1 m}^{\prime }\right) \\ c_{2}(k)=c_{2 m}^{\prime }+\left( \frac{m-i}{m}\right) ^{\beta }\left( c_{2 m}-c_{2 m}^{\prime }\right) \\ \omega (k)=\omega _{m}^{\prime }+\left( \frac{m-i}{m}\right) ^{\beta }\left( \omega _{m}-\omega _{m}^{\prime }\right) \end{array}\right. , \end{aligned}$$
(16)

where m is the maximum number of iterations, i is the current number of iterations, and \(\alpha , \beta \in \) \(\{0.5,1,1.5,2.0\}\).

Meanwhile, the mutation in the later stage of evolution divides the entire population into two parts: one part still follows the original update formula, whereas the position updated formula of the other part is changed to Eq. (17). Part of the particles move in the opposite direction of gbest, increasing group diversity and avoiding falling into local optimal solution.

$$\begin{aligned} x_{i}(t+1)=x_{i}(t)-v_{i}(t) . \end{aligned}$$
(17)

The values of the three parameters (i.e. center, base width, and weight vectors) were encoded as the particle parameters. Moreover, the fitness was calculated for each particle according to the fitness function defined by normalized root mean square error(NRMSE), as follows:

$$\begin{aligned} f=N R M S E=\sqrt{\frac{\sum _{k=1}^{N}\left( y(k)-y_{m}(k)\right) ^{2}}{N \sum _{i=1}^{N} y^{2}(k)}} . \end{aligned}$$
(18)

Then, f is compared with the fitness of \(p b e s t_{i}\) and that of gbest, and the relevant parameters are updated. When optimization reaches the maximum number of iterations or the ideal approximation accuracy, three parameters of the optimized neural network are output.

Table 1 Comparison of different object detection algorithms on nuScenes dataset
Table 2 Comparison of the per-class performance for 3D object detection on the nuScenes dataset

3.6 Loss function

The following is the definition of the overall multi-task loss function of the proposed algorithm:

$$\begin{aligned} L=\lambda _{r c} L_{r c}+\lambda _{a r} L_{a r}+\lambda _{f o} L_{f o}, \end{aligned}$$
(19)

where \(\lambda \) represents adjustment coefficient; rc, the bounding box and center point extraction; ar, the attribute regression; and fo , the optimization module of feature fusion. The adjustment factor \(\lambda \) controls the weight of each task, which determines whether the model has excellent performance and training efficiency. A dynamic weight average approach was used in this study to dynamically adjust the weights of each loss [47]. Here, \(w_k(\cdot )\) denotes the relative rate of loss decline in task k, i.e., the ratio of the current loss to the previous loss

$$\begin{aligned} w_k(t-1)=\frac{L_k(t-1)}{L_k(t-2)}, \end{aligned}$$
(20)

where \(L_k(\cdot )\) denotes the current loss in task k. The larger the ratio, the harder the current task to train, and a larger weight must be assigned. The weight \(\lambda _k\) of task k was updated as follows:

$$\begin{aligned} \lambda _k(t)=\frac{K \exp \left( \frac{w_k(t-1)}{T}\right) }{\sum _i \exp \left( \frac{w_k(t-1)}{T}\right) }, \end{aligned}$$
(21)

where \(K=\sum _i \lambda _i(t)\) ensures that all weights are active within a range and T is the modulation coefficient of the task distribution. That is, the larger the task distribution, the more uniform the task distribution. Initialization should be set up with consistent weights for each task. Further, a priori unbalanced initialization can be introduced according to an actual scenario.

The loss \(L_{r c}\) of the bounding box and center point extraction module is defined as

$$\begin{aligned} L_{r c}=L_{k}+\gamma _{s} \frac{1}{N} \sum _{k=1}^{N}|\hat{S}_{k}-s_{k}|+\gamma _{o} \frac{1}{N} \sum _{r}|\hat{O}_{\tilde{r}}-\left( \frac{r}{R}-\tilde{r}\right) |, \end{aligned}$$
(22)

where, \(\gamma \) is the adjustment factor, N is the number of objects, \(\hat{S}\) is the single size prediction, s is the object size, and \(\hat{O}\) is the local offset.

To increase the robustness of the attribute regression and optimization modules of feature fusion, Huber loss is uniformly used, which is defined as

$$\begin{aligned} L_{*}=\left\{ \begin{array}{c} \frac{1}{2}(\Delta P)^{2} \quad |\Delta P|\le \xi \\ \xi |\Delta P|^{2}-\frac{1}{2} \xi ^{2} \quad |\Delta P|>\xi \end{array}\right. , \end{aligned}$$
(23)

where, \(\Delta P\) is the prediction residual and \(\xi \) is the hyperparameter determined during training.

4 Experimental verification and analysis

In this study, the proposed algorithm was evaluated on the nuScenes [48] and Radiate [49] datasets and then compared with the current popular object detection algorithms. The robustness of the proposed algorithm under different weather conditions was tested on the Radiate dataset. Ablation experiments were also performed on the nuScenes dataset. Finally, to verify the operation effect of the proposed algorithm in an actual scene, relevant experiments were conducted based on a real autonomous car platform. The proposed algorithm ran in a PyTorch framework, which was loaded on a computer with Ubuntu20.04, i7-9700k CPU, and dual 2080Ti GPU.

4.1 Tests on nuScenes dataset

The nuScenes dataset is a large-scale autonomous driving dataset, and it includes a camera and lidar and records radar data. It comprises over 1,000 scenes, including 28,130 training and 6,019 validation samples [48]. It also generally uses the NuScenes detection score (NDS) as a metric, which is a weighted sum of mAP and error metrics.

The performance of several 3D algorithms for object detection was compared on the nuScenes dataset, including lidar-based InfoFocus [50] and PointPillars [51], camera-based MonoDIS [52] and CenterNet [37], camera-radar-based CenterFusion [13], and camera-lidar-radar-based DWD-Fusion [53]. As presented in Table 1, “C,” “R,” and “L” represents whether a camera, radar, or lidar was used. Several indicators, including NDS, mAP, mATE, mASE, mAOE, mAVE, and mAAE, were selected for the evaluation. Meanwhile, mATE, mASE, mAOE, mAVE, and mAAE represented errors in mean translation, scale, orientation, velocity, and attributes, respectively. The up arrow “ \(\uparrow \) ” and the down arrow “ \(\downarrow \) ” imply that higher was better and lower was better, respectively.

Fig. 8
figure 8

Recall-Precision curves

Fig. 9
figure 9

The valid point cloud information retained by the view frustum association. After the frustum association, only the valid point cloud information of the object vehicle and the necessary road contour information were retained in the point cloud BEV below the corresponding image, greatly reducing the amount of point cloud processing

In Table 1, the NDS of the proposed algorithm was higher than that of other methods. Specially, it was 21.25%, 6.83%, 8.02%, and 5.21% higher than CenterNet, PointPillars, CenterFusion, and DWD-Fusion, respectively. The indicator showed the remarkable comprehensive performance of the proposed algorithm. Moreover, lidar-based InfoFocus not only outperformed other algorithms in mAP, but also the error in feature and attribute prediction was significantly lower than that of other algorithms. Compared with CenterFusion [13], the proposed algorithm incorporated a more accurate lidar, which had a significant improvement in all performance indicators. These results demonstrated the reliability of fusing lidar data. Similar to CenterFusion [13], DWD-Fusion [53] also incorporated three types of sensor features. Compared with this method, the proposed algorithm still achieved higher scores in all performance indicators. Moreover, not only the prediction results were accurate, but also ensured the error minimization in terms of speed and direction.

Table 2 presents the detection accuracy of each algorithm on the nuScenes dataset for various objects. The average accuracy of the proposed algorithm for cars, trucks, motorcycles, and bicycles was better than that of other algorithms, which also had better detection results for medium objects. Compared with CenterFusion [13] and DWD-Fusion [53], the proposed algorithm achieved higher scores in detection accuracy for all types of targets, which indicated the superior detection performance of the proposed algorithm.

Figure 8 shows the precision and recall curves of different algorithms. The proposed algorithm considered both precision and recall, which had a better overall performance. As object detection algorithms often required a large amount of time to process complex point cloud information, a direct method to improve the efficiency of the algorithms is removing invalid information in the point cloud. The proposed frustum association method effectively filtered out almost all invalid information in the point cloud. Meanwhile, the valid point cloud information is retained in the ROI frustum. As shown in Fig. 9, only the valid information of each vehicle was retained on the road associated with the frustum, greatly reducing the time for subsequent detection and feature extraction. Furthermore, the detection accuracy was improved to a certain extent.

Figure 10 shows the impact of the presence or absence of frustum association regarding detection time and accuracy in various scenarios. The frustum association effectively reduced the detection time in various scenarios. Furthermore, the maximum reduction reached 70.95% compared with no frustum association. Moreover, the detection accuracy was effectively improved by the frustum association in most scenarios.

Fig. 10
figure 10

Impact on the presence or absence of frustum association regarding detection time and accuracy of the detector in different scenarios. The bar graphs and curves represent time and average accuracy, respectively

Table 3 Ablation experiments on nuScenes dataset

To verify effectiveness of each module among the proposed algorithms, an ablation experiment was performed on the nuScenes dataset. The proposed algorithm associated lidar and radar point cloud with objects through frustum association based on CenterNet. Then, features were extracted from three types of sensors through a feature-level fusion network, which was fused to obtain a final detection result. Therefore, the ablation experiment was divided into two parts. In the first part of the experiment, CenterNet was selected as a baseline to examine the effects of the frustum association, spherical voxel convolution, and feature-level fusion network on the detection results.

Table 3 presents the results of the ablation experiments, as well as the impact of each module on the performance metrics. In the table, FA represents frustum association; SVC, spherical voxel convolution; and FFN, feature-level fusion network. The change in percentage data was compared with the benchmarking CenterNet method.

In the first experiment, only point cloud was associated with objects through the frustum. Information directly used to supplement object feature without convolution extraction and feature fusion includes such as depth, speed, and size. This simple point cloud processing method improved the NDS by 8.6% and mAP by 2.9% compared with CenterNet, which only used a camera. Various attribute errors were also reduced. In the second part of the experiment, point cloud was directly projected onto image plane. Then, an image feature and two unprocessed point cloud features were fused through a feature-level fusion network. The NDS and mAP were increased by 17.4% and 3.4%, respectively. Unlike that in the first experiment, the errors of various attributes were greatly improved. The frustum association and feature-level fusion network were further used to improve the performance of the proposed algorithm on the previous basis. Furthermore, the spherical voxel convolution greatly reduced the directional error. Finally, the NDS of the proposed algorithm was improved by 21.3% compared with that of the baseline method, and its mAP was improved by 5.6% (see Fig. 11).

Fig. 11
figure 11

Visualization in 3D maps. The top and bottom images indicate the 3D frame prediction in the images and 3D frame prediction in the point cloud, respectively

Fig. 12
figure 12

BEV prediction. The top and bottom images indicate the original image and 2D box prediction in the BEV, respectively

For object detection algorithms, the number and type of sensors were not the more the better. A large amount of sensor data sometimes affected the judgment of the algorithm to correctly detect objects, as well as the speed and efficiency of detection. Therefore, the performance of some multimodal algorithms maybe lower than that of a single lidar-based algorithm. Comparing the role of each sensor in the fusion algorithm became the basis for judging whether the algorithm was reasonable and whether made full use of various data. The influence about three types of sensor data on the detection accuracy of the proposed algorithm was compared in the next part (see Fig. 12).

The NDS score and mAP in the four models were tested: camera-only, camera-radar, camera-lidar, and camera-radar-lidar. Figure 13 presents the results. The experimental results indicate that the multimodal method was higher than other the single-camera based method. In the case of the camera-radar-lidar model, the NDS score increased by 12.01%, and the mAP increased by 4.39% compared with that of the camera-radar model. Although radar supplies more extra features, the average accuracy still cannot be improved. However, the high accuracy of lidar significantly improves the average accuracy of the proposed algorithm. The camera-radar-lidar model is integrated in this study, which improves the accuracy and provides more object attribute feature. Therefore, the proposed method based on this model meets the need of automatic driving systems for object detection in the greatest extent.

Fig. 13
figure 13

Sensor ablation experiments of the proposed algorithm on the nuScenes dataset. “C,” “R,” and “L” represent the camera, radar, and lidar, respectively

4.2 Tests on Radiate dataset

Radiate is a severe weather dataset released by the Radiate project of Heriot-Watt University in Scotland, comprising 3h of radar images and 200,000 marked road signs, including other vehicles and pedestrians, especially for common severe weather conditions [49]. The actual effect of the algorithms validated on this dataset is helpful in examining the safety of autonomous driving in bad weather conditions.

This experimental module mainly verified the robustness of the proposed algorithm under different weather conditions. In this paper, three states of the art algorithms using different sensors were selected for comparison with the proposed algorithm, and average accuracy was tested in five weather conditions: day, night, rainy, snowy, and foggy. Table 4 presents the results of the accuracy comparison among different algorithms.

Table 4 Accuracy comparison among different algorithms on Radiate under different weather conditions

The experimental results presented in Table 4 indicate that the accuracy of InfoFocus based on lidar-only was slightly higher than that of the proposed algorithm in normal weather, such as day and night. However, the proposed algorithm had clear advantages in rainy, snowy, and foggy weather conditions. Compared with that of the CenterFusion method, which achieved good results in rainy, snowy, and foggy weather conditions, the accuracy of the proposed algorithm was 8.83%, 7.02%, and 7.99% higher in these three weather conditions, respectively. The average accuracy of this algorithm was also 6.73% higher than that of DWD-Fusion, which also used three sensors.

To more intuitively reflect the performance of various algorithms, test data were drawn in a graph in Fig. 14. Generally, a camera and lidar were easily interfered in bad weather conditions and could not provide accurate scene information for object detection, whereas radar was well-adapted to bad weather conditions. In normal weather conditions, lidar could provide high-precision 3D point cloud information compared to a camera and radar. Figure 12 shows that the detection accuracy of InfoFocus was higher than the other algorithms during day and night, whereas multimodal algorithms were limited by different sensor fusion methods, whose accuracy was slightly lower. In rainy, snowy, and foggy weather conditions, the accuracy of InfoFocus and F-PointNet was significantly reduced, while CenterFusion, DWD-Fusion, and the proposed algorithm could still maintain a relatively stable level. Specially, the detection results were improved based on feature-level fusion by using three types of sensors. Therefore, the accuracy of the proposed algorithm was the highest in bad weather conditions.

Fig. 14
figure 14

Accuracy comparison curves among different algorithms on Radiate under different weather conditions

The model was trained on the experimental platform and the loss curves were plotted in Fig. 15 for different weather conditions. In this case, the batch setting was 12, the learning rate was set to 0.01, and training was performed 100 times. Furthermore, the loss curves varied in different weather conditions. Among them, the best and second best performance were in daytime and rainy, respectively. Generally, the gradient decline was particularly evident at the beginning and then converged to a stable level at the later stage. The experimental results showed that the algorithm model had excellent prediction ability.

Based on these experiments, the proposed algorithm had better robustness in different weather conditions, and the detection performance had better stability.

Figures 16 and 17 show the visualization effect of the proposed algorithm on the Radiate dataset under normal day and that in night, rainy, snowy, and foggy weather conditions, respectively. The experimental results indicate that the proposed algorithm could achieve good detection accuracy, regardless in normal weather conditions during daytime or in extreme weather conditions (e.g., night, rainy, snowy, and foggy days).

Fig. 15
figure 15

Loss curves for different weather on the Radiate dataset

Fig. 16
figure 16

Visualization under normal weather conditions during daytime. The prediction results of the camera, lidar, and radar were from top to bottom, respectively

Fig. 17
figure 17

Visualization effect on night, rainy, snowy and foggy days. a Night scene, b rainy scene, c snowy scene, d foggy scene

4.3 Tests on real test site

To verify the actual operation effect of the proposed algorithm, this paper relies on an actual vehicle platform that was used to conduct road experiments in the test site of Suzhou Automotive Research Institute of Tsinghua University, where all types of advanced facilities were used to simulate various weather conditions. As shown in Fig. 18, the actual platform mainly comprises three types of sensors: camera, lidar, and radar. The camera provided clear image information, and the 64-line lidar provided rich point cloud information in a short distance. Moreover, the radar provided relatively sparse point cloud information in a wide range, still obtaining good results in bad weather. This algorithm mainly handled these three types of sensor data to detect surrounding vehicles in bad weather.

In the actual experiment, the detection accuracy of the proposed algorithm can meet the need of practical applications and was verified under different weather conditions. Three types of sensor data were collected from the actual platform, constructed into a dataset, and input into the proposed model. Finally, the obtained detection results were plotted to a graph in Fig. 19. The experimental results indicate that the proposed algorithm had a satisfactory detection accuracy for three types of medium and large vehicles in various weather conditions. Moreover, the mAP could reach a maximum of 0.855, which still maintained a sufficiently high accuracy even in snowy and foggy weather conditions. Experiments showed that the proposed algorithm had excellent robustness and strong generalization ability in various weather and environments.

Finally, the real scene data collected by the actual platform was placed into a trained model, and 3D object detection was performed in real time. Figure 20 shows the output feature map of point cloud. This algorithm accurately identified complex vehicles and pedestrians, which relied on the advantages of segmented fusion to make full and reasonable use of sensor data. Furthermore, the proposed method made rough predictions for distant objects, which further proved the excellent performance and mechanism of the proposed algorithm.

Fig. 18
figure 18

Real vehicle platform in a real test site

Fig. 19
figure 19

Detection accuracy of various objects in different weather conditions

Fig. 20
figure 20

Visualization of the detection results from the real dataset

5 Conclusion

This paper proposed a multi-sensor segmental fusion of the frustum method for the 3D object detection algorithm in autonomous driving. The fusion fully exploited the advantages of each sensor and hence improved its accuracy in terms of complex weather conditions during the driving process. The frustum association method accurately associates lidar and radar detection to objects, greatly reducing the amount of point cloud detection. The spherical voxel convolution method was also exploited to extract the rotation-invariant feature of point clouds, supplementing rotation information

of the object. The dynamic adaptive feature-level fusion network both quickly and accurately obtains fusion features and improves detection results and supplements object attributes. Finally, experiments were performed on the nuScenes dataset, the Radiate dataset, and a real test site. The results indicate that the proposed algorithm has higher accuracy, richer object information, stronger generalization ability, and better robustness in complex weather conditions compared with other algorithms. Considering that the proposed algorithm relies on the 2D detection frame in the image to generate the frustum, and the image cannot provide sufficient information in extreme weather conditions (e.g., completely lightless darkness and very dense fog), further work will improve the fusion method to handle various complex situations.