Keywords

1 Introduction

3D object detection is an indispensable component in AD perception system and robotic domain. LiDAR-based 3D object detection has drawn the focus of researchers due to the rich depth and geometry provided by LiDAR point clouds. However, most methods primarily excel in large or densely sampled objects(e.g., car) while they often struggle to achieve satisfactory detection performance on small and distant hard cases(e.g., cyclist, pedestrian).

Previous voxel-based methods [2,3,4,5,6,7,8] voxelize point clouds and perform 3D sparse convolution on the voxels to extract features. However, due to the inherent sparsity and varying density of LiDAR point clouds, detectors utilizing ordinary cubic voxelization suffer from an increased number of empty voxels. This results in an incomplete representation of objects in the point clouds, missing object-level information. Additionally, the imbalanced points distribution in cubic voxels inevitably introduces significant computational overhead.

Building upon the limitations of cubic voxelization, [9] introduces a cylindrical voxelization approach. It partitions the point cloud in a manner that aligns with the rotational or radial scanning pattern of LiDAR. Naturally, voxels should be larger in regions where the point cloud becomes sparser. This voxel representation preserves the spatial structure of objects, resulting in more compact voxel features. It has been proven to exhibit superior performance in outdoor point cloud semantic segmentation task. Prior to it, there have been explorations of LiDAR-based multi-view fusion methods [10, 11] applied to object detection. These methods concatenate voxel/pillar features from bird’s-eye view and spherical/cylindrical view, and then propagate the features to points through voxel-point mapping to obtain point-level semantics.

From the above methods, they have the following drawbacks: 1) Traditional LiDAR-based detectors that solely use cubic voxels suffer from information loss due to inherent voxelization limitations, resulting in poor detection performance on small objects. 2) Methods [10, 11] that fuse multiple representations of LiDAR point clouds employ a heavy voxel feature encoder (e.g., stacked PointNet) before the 3D backbone, which increases time and memory consumption. Although point-level features can provide fine-grained semantic information, they unavoidably introduce detrimental background noise from different views.

To address the aforementioned issues, we present a simple yet efficient 3D object detector, termed SVFNeXt, that effectively utilizes the complementary information from LiDAR cross-representation learning through sparse voxel fusion. Our method comprises three parts: Dynamic Distance-aware Cylindrical Voxelization (DDCV), Foreground Centroid-Voxel Selection-Query-Fusion (FCVSQF) and Object-aware Center-Voxel Transformer (OCVT).

Specifically, In the DDCV module, we adapt the cylindrical voxelization in [9] with non-uniform distance intervals along the \(\rho \)-axis. Consequently, much larger voxels are generated for distant regions. Furthermore, dynamic voxelization [11] avoids hard-coding the number of points within each voxel, maximizing points utilization without dropping any points, hence minimizing information loss. The FCVSQF module employs the centroid of points within each voxel as the query source and target instead of the voxel center, thus preserving the original 3D geometry and accurately representing voxel features. In order to save memory and prevent incurring background voxel noise, we focus on a few important foreground cubic centroid-voxels for local features query and fusion in cylindrical voxels. Additionally, we design a loss function to ensure the sampling of foreground centroid-voxels. The OCVT module further enhances the refined cubic voxel features by capturing long-range object-level global information via transformer [12]. It specially attends to voxels surrounding the object center.

The three modules unite to develop the final enhanced cubic voxels for compact and accurate detection. Extensive experiments on public benchmarks demonstrate that SVFNeXt significantly boosts the detection performance due to sparse voxel fusion, especially on small and distant objects. Meanwhile, we also show comparable results with the state-of-the-art methods on large objects (e.g., car, vehicle).

2 Related Work

2.1 Voxel-Based 3D Detectors

Mainstream voxel-based methods [2,3,4, 7, 8, 22] typically partition the point cloud into cubic voxels and extract features using sparse convolutions. [2] utilizes more efficient 3D sparse convolutions to accelerate VoxelNet [6]. [7] collapses voxels into pillars along the z-axis and employs 2D convolutions to speed up. [3] refines proposals with RoI-grid pooling in a second stage. [4, 8] aggregate voxel features using key points for box refinement. [22] addresses uneven point cloud density by considering point density within voxels. Although the regular grid structure of cubic voxelization enables efficient feature extraction with CNN, the receptive field is limited by the convolutional kernel. In contrast, our method enlarges the receptive field indirectly through cross-representation query.

2.2 Fusion-Based 3D Detectors

Fusion-based methods can be broadly categorized into multi-modal and multi-representation fusion. The former absorbs data from different sensors (e.g., LiDAR and camera), and has been supported by numerous methods [13, 15, 16]. Some [13, 15] encode features from different modalities separately and fuse at the proposal-level or in the BEV feature map, while [16] employs attention mechanisms for feature fusion and alignment. However, feature misalignment and the additional branch may impact efficiency and real-time performance. The latter usually fuses data from the same source (e.g., LiDAR). [10, 11] attempt point-level fusion of different views, but they may introduce noise and have limited impact on the receptive field. Nevertheless, our approach selectively enhances foreground centroid-voxels, with another alternative LiDAR representation to expand the receptive field and leverage complementary information.

2.3 Transformer-Based 3D Detectors

Transformer [12] has recently demonstrated its superiority in 2D vision tasks. Exploiting the permutation invariance of point clouds, applying transformer to 3D vision is a favorable choice. In pioneer works [17,18,19,20,21], attention mechanisms are employed at different stages of the 3D detection pipeline (e.g., 3D backbone [17,18,19], dense head [20], RoI head [21, 22]) to learn contextual information. However, directly applying vanilla transformer to massive point clouds is infeasible in terms of time and space. Therefore, we focus specifically on voxels near the object center to capture long-range dependencies.

3 SVFNeXt for 3D Object Detection

We propose a sparse cross-representation voxel feature fusion and refinement method called SVFNeXt, integrating voxel-level features during the sparse features extraction. Our objective is to make minimal modifications and provide a simple and efficient plugin that can be easily incorporated into generic detection pipeline, as illustrated in Fig. 1. SVFNeXt consists primarily of three modules: Dynamic Distance-aware Cylindrical Voxelization (Sect. 3.1), Foreground Centroid-Voxel Selection-Query-Fusion (Sect. 3.2), and Object-aware Center-Voxel Transformer (Sect. 3.3).

Fig. 1.
figure 1

A schematic overview of SVFNeXt.

3.1 Dynamic Distance-Aware Cylindrical Voxelization

To maintain the 3D geometric structure of objects in point clouds, adapt to the rotational scanning manner of LiDAR and varying sparsity of point clouds, we introduce dynamic distance-aware cylindrical voxelization, as shown in Fig. 2. This technique converts points from Cartesian coordinate to Cylindrical coordinate and partitions voxels unevenly along the \(\rho \)-axis without dropping any points, unlike [9].

Given a point cloud \(P_{\textrm{cart}} = \{(x_i, y_i, z_i)\}_{i = 1}^{N_\textrm{p}}\) defined in Cartesian coordinate system, its Cylindrical coordinate representation is calculated as \(P_{\textrm{cyl}} = \{(\rho _i, \varphi _i, z_i)\}_{i=1}^{N_{\textrm{p}}}\), where

$$\begin{aligned} \rho _i = \sqrt{x_i^2 + y_i^2} \qquad \varphi _i = \arctan (\frac{y_i}{x_i}) \qquad z_i = z_i \end{aligned}$$
(1)

where \(N_{\textrm{p}}\) is the number of points in the point cloud.

Dynamic voxelization [11] means points are assigned to the volume space of the grid dynamicly based on their spatial coordinates. As for cylindrical points set \(P_{\textrm{cyl}}\) and voxels set \(V_{\textrm{cyl}}\), voxelization can be described as a bidirectional mapping between points and voxels, formally,

$$\begin{aligned} V_{\textrm{cyl}} &= \{ v_j \mid \mathcal {M}_{v}(p_i) = v_j, p_i \in P_{\textrm{cyl}}, \forall i\}_{j=1}^{M} \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {M}_p(v_j) &= \{ p_i \mid \forall p_i \in v_j, v_j \in V_{\textrm{cyl}}\} \end{aligned}$$
(3)

where M is the number of non-empty voxels, \(\mathcal {M}_{v}(\cdot )\) denotes mapping from point to voxel, \(\mathcal {M}_p(\cdot )\) denotes mapping from voxel to point.

Fig. 2.
figure 2

Top-down view of regular(left, \(\varDelta \rho _1 = \varDelta \rho _2 = \varDelta \rho _3\)) vs. distance-aware(right, \(\varDelta \rho _1 < \varDelta \rho _2 < \varDelta \rho _3\)) cylindrical voxelization.

Distance-aware cylindrical voxelization involves unequal partition across different \(\rho \) intervals along the \(\rho \)-axis in the Cylindrical coordinate system. Thus, the farther away from the origin(i.e., LiDAR, O in Fig. 2), the sparser points, the larger voxels, allowing more points to reside in, as shown in Fig. 2(b). Define voxel size as \(V_s = (\varDelta \rho , \varDelta \varphi , \varDelta z)\), discussed by cases,

$$\begin{aligned} V_s = {\left\{ \begin{array}{ll} (\varDelta \rho _1, \varDelta \varphi , \varDelta z), &{} 0 \leqslant \rho < \rho _1 \\ (\varDelta \rho _2, \varDelta \varphi , \varDelta z), &{} \rho _1 \leqslant \rho < \rho _2 \\ (\varDelta \rho _3, \varDelta \varphi , \varDelta z), &{} \rho \geqslant \rho _2 \end{array}\right. } \end{aligned}$$
(4)

where \(\varDelta \rho _1 < \varDelta \rho _2 < \varDelta \rho _3\), we can also term \([0, \rho _1)\) as close, \([\rho _1, \rho _2)\) as medium and \([\rho _2, +\infty )\) as far.

3.2 Foreground Centroid-Voxel Selection-Query-Fusion

Various approaches [3, 4, 8, 21] have been explored to determine the voxel center as a representation of voxel feature position. However, they tend to treat voxels with different point distributions equally, inevitably misleading model and overlooking important geometric details. Observed by [22], we adopt voxel centroid as a position representative to achieve accurate feature querying. Besides, centroid aligns well with our DDCV module, which captures the distribution pattern of points within each voxel. Hence, we should first locate the centroid of each voxel after initial cubic and cylindrical voxelization.

Let’s assume cylindrical voxels set \(V_{\textrm{cyl}} = \{v_j = \{I_{\textrm{cyl}}^{v_j}, F_{\textrm{cyl}}^{v_j}\}\}_{j=1}^M\), cubic voxels set \(V_{\textrm{cub}} = \{ u_i = \{I_{\textrm{cub}}^{u_i}, F_{\textrm{cub}}^{u_i}\}\}_{i=1}^N\), for each representation, \(I \in \mathbb {R}^3\) is the index of voxel and \(F \in \mathbb {R}^{3+\textrm{c}}\) is the corresponding voxel feature, \(\textrm{c}\) is the number of channels of extra features(e.g., intensity, elongation). To illustrate, taking \(V_{\textrm{cyl}}\) as an example, voxel centroid can be computed by taking the average spatial coordinates of the points within the voxel. Specifically, for \(v_j \in V_{\textrm{cyl}}\), centroid

$$\begin{aligned} \mathcal {C}_{\textrm{cyl}}^j = \frac{1}{\mathcal {N}(v_j)} \sum _{p_j \in v_j} p_j \end{aligned}$$
(5)

where \(p_j = (\rho _j, \varphi _j, z_j)\), \(\mathcal {N}(v_j)\) is the number of points within the voxel \(v_j\). Thus, for cubic voxels set \(V_{\textrm{cub}}\), we can also compute voxel centroid \(\mathcal {C}_{\textrm{cub}}^i\) the same way as Eq. 5.

Fig. 3.
figure 3

The FCVSQF module. Initially, we locate the centroids and retrieve centroid-voxel features of both representations at scale s. We then selectively fuse foreground cubic centroid-voxels with cylindrical centroid-voxels by ball-query, resulting in \(\textbf{F}_{\scriptscriptstyle {{\textrm{cyl}}}}^{\mathcal {C}_s})_p\). Finally, we fuse the pooled features with the selected foreground cubic centroid-voxel features \((\textbf{F}_{\scriptscriptstyle {\textrm{cub}}}^{\mathcal {C}_s})_f\), to generate refined cubic voxel features \((\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_f^r\).

Centroid-Voxel Features Retrieval. After obtaining the voxel centroids from two different voxel representations, we perform Scale and Group operations to get centroids \(\mathbf {\mathcal {C}}_*^{s} \in \mathbb {R}^{n_* \times 3}\) and corresponding voxel(i.e., centroid-voxel) indices \(\textbf{I}^{{\mathcal {C}}_s}_* \in \mathbb {R}^{n_* \times 3}\) from feature map \(\textbf{F}_*^s \in \mathbb {R}^{N_* \times c_s}\) from 3D sparse CNN at scale s. Then, Search the whole sparse feature map \(\textbf{F}_*^s\) for centroid-voxels based on indices and retrieve the associated voxel features \(\textbf{F}_{*}^{{\mathcal {C}}_s} \in \mathbb {R}^{n_* \times c_s}\), Here, \(*=\{\textrm{cub, cyl}\}\), \(n_* < N_*\), \(c_s\) is the channels of voxel features. Formally, given the initial voxel indices \(\textbf{I}_*\), voxel centroids \({\mathbf {\mathcal {C}}}_*\) and downsample factors \(D = \{1, 2, 4, 8\}\) of feature map \(\textbf{F}_*\) at each scale,

$$\begin{aligned} \textbf{F}_*^{{\mathcal {C}}_s} = \mathcal {S}_2\left( \mathcal {G}\left( \mathcal {S}_1(\textbf{I}_*, D_s), \mathbf {\mathcal {C}}_*\right) , \textbf{F}_*^s\right) \end{aligned}$$
(6)

where \(s \in \{1, 2, 3, 4\}\), \(\mathcal {S}_1\) denotes Scale, \(\mathcal {G}\) denotes Group, \(\mathcal {S}_2\) denotes Search.

Centroid-Voxels Selection-Query-Fusion. To obtain refined cubic centroid-voxel features, it is crucial to select the foreground centroid-voxels for feature aggregation. We focus on those that are important rather than all. Thus avoiding background noise from cylindrical centroid-voxels, which offers no benefit to detection. Unlike [17], which uniformly aggregates features from all non-empty voxels. Moreover, our method ensures expanded effective receptive field while maintaining high efficiency.

We follow the three steps: foreground cubic centroid-voxels selection, cross-representation query, and fusion. Referring to Fig. 3, with centroid-voxels from both representations involved, we focus more on the cubic one following the common practice, and the other as an auxiliary. Firstly, we select the top-k centroid-voxels as the query source according to foreground scores. Then, perform MSG ball-query [1] within the cylindrical centroid-voxels based on the related centroids. This allows us to pool cylindrical features within a local range and provide more fine-grained geometric information. Finally, we fuse the pooled features from cylindrical centroid-voxels with the selected foreground cubic centroid-voxel features. As a result, we obtain the refined features.

Let’s assume that fpr denote foreground, pooled and refined, \(\mathcal {S, Q, F}\) denote Selection, Query and Fusion, respectively, \(\{(\textbf{F}_{\scriptscriptstyle {\textrm{cub}}}^{\mathcal {C}_s})_f, (\textbf{F}_{\scriptscriptstyle {{\textrm{cyl}}}}^{\mathcal {C}_s})_p, (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_f^r\} \in \mathbb {R}^{n_f \times c_s}\), \(({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f \in \mathbb {R}^{n_f \times 3}\), \(n_f\) is the number of selected foreground cubic centroid-voxels. Accordingly, the SQF part illustrated by Fig. 3 can be formulated as

$$\begin{aligned} \left[ (\textbf{F}_{\scriptscriptstyle {\textrm{cub}}}^{\mathcal {C}_s}) _f, ({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f\right] &= \mathcal {S}\left( \texttt {SubM3d}(\textbf{F}_{\scriptscriptstyle {\textrm{cub}}}^{\mathcal {C}_s}), {\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s} \right) \end{aligned}$$
(7)
$$\begin{aligned} (\textbf{F}_{\scriptscriptstyle {{\textrm{cyl}}}}^{\mathcal {C}_s})_p &= \texttt {Linear}\left( \mathcal {Q}\left( ({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f, {\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cyl}}}^{s}, \textbf{F}_{\scriptscriptstyle {{\textrm{cyl}}}}^{\mathcal {C}_s} \right) \right) \end{aligned}$$
(8)
$$\begin{aligned} (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_f^r &= \mathcal {F}\left( (\textbf{F}_{\scriptscriptstyle {\textrm{cub}}}^{\mathcal {C}_s})_f, (\textbf{F}_{\scriptscriptstyle {{\textrm{cyl}}}}^{\mathcal {C}_s})_p \right) \end{aligned}$$
(9)

3.3 Object-Aware Center-Voxel Transformer

Previously, we obtain refined foreground cubic centroid-voxel features marked as features of interest to attend. They include fine-grained features from another more informative cylindrical representation, partially compensating for the loss of object information represented by cubic voxels. However, they may potentially lack interaction due to independent feature aggregation. Furthermore, it is essential to incorporate global information into the feature for detecting small and distant objects. Therefore, we propose OCVT, guided by the object center, to effectively capture long-range context at object level, as shown in Fig. 4.

3D Sparse Heatmap Generation. Leveraging the selected foreground centroids \(({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f\) from FCVSQF module, given each annotated bounding box \(B_k\) centered at \((x_k ,y_k, z_k)\), we calculate the distance between center of \(B_k\) and centroid \((\hat{x}_i^k, \hat{y}_i^k, \hat{z}_i^k)\) situated at \(B_k\) in \(({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f\). Then, a 3D Gaussian kernel is applied to confine the heatmap response within the range of [0,1]. Formally,

$$\begin{aligned} \hat{H}_i = \exp \left( -\frac{ (x_k - \hat{x}_i^k)^2 + (y_k - \hat{y}_i^k)^2 + (z_k - \hat{z}_i^k)^2 }{2\sigma _k^2}\right) \in [0, 1] \end{aligned}$$
(10)

where \(\sigma _k\) is an object size-adaptive standard deviation [24], \(\hat{H}_i\) is the heatmap value generated at centroid i. Taking all centroids, we obtain the final target 3D Sparse Heatmap \(\hat{\textbf{H}}\), calculating a loss with the predicted heatmap \(\textbf{H}\). We thereby can carefully choose centroid-voxels closely aligned with the object center.

Fig. 4.
figure 4

The OCVT module. We first generate a 3D sparse heatmap based on Foreground Centroid-voxels, then sample Center-voxels around object center according to heatmap values, to perceive long-range object-level context with transformer encoder.

Center-Voxel Transformer. We focus solely on a subset of centroid-voxels closest to the object center to build object-level contextual dependencies, thereby improving efficiency. Similar to the Selection part in Fig. 3, the location of top K voxels based on the predicted heatmap scores will be taken out as the center-voxels. We denote center-voxel features as \((\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_{ctr}\), center centroids as \(({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_{ctr}\), they are then fed into Transformer encoder block:

$$\begin{aligned} (\textbf{F}_{\scriptscriptstyle \textrm{cub}}^s)_{ctr} = \mathcal {T}(\textbf{Q}, \textbf{K}, \textbf{V}) \end{aligned}$$
(11)
$$\begin{aligned} \textbf{Q} = \textbf{W}_q (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_{ctr} + \textbf{E}_{pos}, \textbf{K} = \textbf{W}_k (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_{ctr} + \textbf{E}_{pos}, \textbf{V} = \textbf{W}_v (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_{ctr} \end{aligned}$$
(12)

where \(\mathcal {T}\) denotes Transformer, Q, K, V are query, key, value features, \(\textbf{E}_{pos}\) is positional embedding transformed by a linear layer applied to \(({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_{ctr}\) .

Eventually, we scatter \((\textbf{F}_{\scriptscriptstyle \textrm{cub}}^s)_{ctr}\) back to the 3D sparse feature map at scale s, resulting in the enhanced cubic voxel features. The final enchanced features are equipped with both rich local features from cylindrical voxels, and long-range global contextual dependencies from object centers.

3.4 Loss Functions

The overall loss function comprises four components: foreground loss and heatmap loss from the 3D backbone, RPN loss and RCNN loss (for 2-stage models). We adhere to [3, 4] for RPN loss \(\mathcal {L}_{\textrm{rpn}}\) and RCNN loss \(\mathcal {L}_{\textrm{rcnn}}\). Regarding the FCVSQF module, we employ foreground loss \(\mathcal {L}_{\textrm{fore}}\) computed by focal loss [26] with BCE. For the OCVT module, we utilize sparse heatmap loss \(\mathcal {L}_{\textrm{hm}}\) calculated by smooth-L1 loss. The final loss is the weighted sum of the four parts above: \(\mathcal {L} = w_1 \mathcal {L}_{\textrm{fore}} + w_2 \mathcal {L}_{\textrm{hm}} + w_3 \mathcal {L}_{\textrm{rpn}} + w_4 \mathcal {L}_{\textrm{rcnn}}\), the weights we used in our experiments are [1, 1, 1, 1], respectively.

4 Experiments

4.1 Datasets

KITTI. The KITTI dataset contains 7481 training samples and 7518 testing samples. Typically, the training data are divided into a train set with 3712 samples and a val set with 3769 samples. It uses average precision (AP) on easy, moderate and hard levels as evaluation metric.

Waymo Open Dataset. The WOD dataset consists of 798 sequences for training and 202 sequences for validation. The evaluation metrics include average precision (AP) and average precision weighted by heading (APH). We report the results on both LEVEL 1 (L1) and LEVEL 2 (L2) difficulty levels.

4.2 Implementation Details

For the cubic voxelization, we follow PV-RCNN [4] settings (i.e., voxel size and point cloud range) on both datasets. For the cylindrical voxelization, the ranges are [0, 80] m, [\(-\pi \)/2, \(\pi \)/2] rad, and [ −3, 1] m along the \(\rho \), \(\varphi \) and z axis, respectively, with voxel size of (0.05 m, \(\pi \)/180 rad, 0.1 m) on KITTI. While on WOD, the ranges are [0, 107.84] m, [\(-\pi \), \(\pi \)] rad and [−2, 4] m along the \(\rho \), \(\varphi \) and z axis. The voxel size is (\(\varDelta \rho \), \(\pi \)/360 rad, 0.15 m), \(\varDelta \rho \) varies as the DDCV module illustrated: the ranges across distance are [0, 30.24) m, [30.24, 50.24) m, [50.24, 107.84] m, where the \(\varDelta \rho \) is set as 0.09 m, 0.10 m and 0.15 m accordingly.

4.3 Main Results

KITTI. With [2,3,4,5] as our baselines, the experimental results on val and test set are presented in Table 1 and Table 2, respectively. Our model exhibits notable improvements in both 3D and BEV mAP(e.g., 1.17%, 1.67%, and 3.02% 3D AP on Mod. level). Notably, our approach significantly enhances performance for Ped. and Cyc. categories at the moderate difficulty while maintaining strong capability for Car class. A visual comparison shown in Fig. 5 explains our method can better detect, align and orient objects. Moreover, our method shows competitive results on the test set and further verifys the effectiveness of our method.

Table 1. Performance comparison of 3D/BEV detection with AP R40 on KITTI val set. \(\dagger \): re-implemented by [27]. \(\ddagger \): reported by [27]. SVF: Sparse Voxel Fusion.
Table 2. Performance comparison with different models on the KITTI test set for Car and Cyclist. The top-2 best performances are highlighted in bold.
Table 3. Performance comparison of 3D detection on WOD val set, training with 20% train set. \(\ddagger \): reported by [27]. SVF: Sparse Voxel Fusion.
Table 4. The detection results on WOD val set, training with full train set. \(\dagger \): re-implemented by [27] with kernel size as 3 in 3D backbone.
Table 5. Effect of each component on WOD val set with [2] as baseline (), training with 20% train set. DRCV: Regular voxelization in Fig. 2.

WOD. We conduct experiments on the large-scale WOD and report the results in Table 3 and Table 4 on the val set. As shown in Table 3, our method consistently improves performance across all categories, similar to what we observe in KITTI. Notably, our method significantly outperforms baselines on mAPH (L2), with margins of 3.55%, 2.15%, 1.56%, and 2.63%, particularly for small objects. Furthermore, we summarize the comparison between our approach and state-of-the-art methods in Table 4.

4.4 Ablation Study

We conduct ablation studies on each proposed module shown in Table 5. The unified effect of the three modules results in a significant gain of 3.6% mAPH (L2) on both overall and far range (i.e., 50 m–inf).

DDCV. We observe that distance-aware voxelization (DDCV) outperforms regular voxelization (DRCV). The former shows improvements of 1.5%, 1.5% and 2.1% overall APH (L2) for Veh., Ped., and Cyc., respectively. This confirms the capability of cylindrical voxels to provide richer information and refine object representation.

FCVSQF. It utilizes foreground centroids for feature fusion, preserving original geometric shape information. This optimization helps refine foreground voxel features and expands the receptive field through local query, particularly benefiting sparse, small and distant objects. Notably, it brings a performance gain of 1.5% mAPH (L2) on far range.

OCVT. Guided by object center, it models long-range contextual dependencies at object level with center-voxels, further refining the sparse voxel features. This brings a slight performance gain of 0.7% and 0.9% mAPH (L2) on overall and far range, respectively.

Fig. 5.
figure 5

A visual comparison of SVFNeXt vs. PV-RCNN on KITTI val set. means ground truth, and means detection box.

5 Conclusion

In this paper, we propose SVFNeXt, a plug-and-play fusion-based 3D backbone that can be applied to most voxel-based 3D detectors. As a rarely explored approach, we address the limitations of conventional cubic voxels by leveraging cylindrical voxels with more uniform points distribution, providing richer information for accurate detection. Our centroid-based cross-voxel query and local features fusion partially alleviate the issue of incomplete object representation in cubic voxels, incorporating fine-grained features and enlarging receptive field. What’s more, the object-level global information learning further refines feature representations, benefiting the subsequent detection. Extensive experiments on the public benchmarks serve as a compelling evidence of our model efficacy.

Note that our approach falls slightly short of the state-of-the-art methods on larger objects in some case. Our subsequent endeavor involves delving into model generalization to narrow this gap and enhance its performance.