SVFNeXt: Sparse Voxel Fusion for LiDAR-Based 3D Object Detection

Zhao, Deze; Zhao, Shengjie; Liang, Shuang

doi:10.1007/978-981-99-7025-4_17

Deze Zhao¹²,
Shengjie Zhao^12,13 &
Shuang Liang¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14327))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

822 Accesses

Abstract

Voxel-based 3D object detection methods have gained more popularity in autonomous driving. However, due to the sparse nature of LiDAR point clouds, voxels from conventional cubic partition lead to incomplete representation of objects in farther range. This poses significant challenges to 3D object perception. In this paper, we propose a novel 3D object detector dubbed SVFNeXt, a Sparse Voxel Fusion Network that performs cross-representation (X) feature learning. It is because cylindrical voxel representation considers the rotational or radial scanning of LiDAR that we can better explore the inherent 3D geometric structure of point clouds. To further enchance cubic voxel features, we innovatively integrates the features of cylindrical voxels into cubic voxels, incorporating both local and global features. We particularly attend to informative voxels by two additional losses, striking a good speed-accuracy tradeoff. Extensive experiments on the WOD and KITTI datasets demonstrate consistent improvements over baselines. Our SVFNeXt achieves competitive results compared to state-of-the-art methods, especially for small objects(e.g., cyclist, pedestrian).

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Small-Size 3d Object Detection Network for Analyzing the Sparsity of Raw Lidar Point Cloud

Article 24 November 2023

Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection

SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds

Keywords

1 Introduction

3D object detection is an indispensable component in AD perception system and robotic domain. LiDAR-based 3D object detection has drawn the focus of researchers due to the rich depth and geometry provided by LiDAR point clouds. However, most methods primarily excel in large or densely sampled objects(e.g., car) while they often struggle to achieve satisfactory detection performance on small and distant hard cases(e.g., cyclist, pedestrian).

Previous voxel-based methods [2,3,4,5,6,7,8] voxelize point clouds and perform 3D sparse convolution on the voxels to extract features. However, due to the inherent sparsity and varying density of LiDAR point clouds, detectors utilizing ordinary cubic voxelization suffer from an increased number of empty voxels. This results in an incomplete representation of objects in the point clouds, missing object-level information. Additionally, the imbalanced points distribution in cubic voxels inevitably introduces significant computational overhead.

Building upon the limitations of cubic voxelization, [9] introduces a cylindrical voxelization approach. It partitions the point cloud in a manner that aligns with the rotational or radial scanning pattern of LiDAR. Naturally, voxels should be larger in regions where the point cloud becomes sparser. This voxel representation preserves the spatial structure of objects, resulting in more compact voxel features. It has been proven to exhibit superior performance in outdoor point cloud semantic segmentation task. Prior to it, there have been explorations of LiDAR-based multi-view fusion methods [10, 11] applied to object detection. These methods concatenate voxel/pillar features from bird’s-eye view and spherical/cylindrical view, and then propagate the features to points through voxel-point mapping to obtain point-level semantics.

From the above methods, they have the following drawbacks: 1) Traditional LiDAR-based detectors that solely use cubic voxels suffer from information loss due to inherent voxelization limitations, resulting in poor detection performance on small objects. 2) Methods [10, 11] that fuse multiple representations of LiDAR point clouds employ a heavy voxel feature encoder (e.g., stacked PointNet) before the 3D backbone, which increases time and memory consumption. Although point-level features can provide fine-grained semantic information, they unavoidably introduce detrimental background noise from different views.

To address the aforementioned issues, we present a simple yet efficient 3D object detector, termed SVFNeXt, that effectively utilizes the complementary information from LiDAR cross-representation learning through sparse voxel fusion. Our method comprises three parts: Dynamic Distance-aware Cylindrical Voxelization (DDCV), Foreground Centroid-Voxel Selection-Query-Fusion (FCVSQF) and Object-aware Center-Voxel Transformer (OCVT).

Specifically, In the DDCV module, we adapt the cylindrical voxelization in [9] with non-uniform distance intervals along the $\rho $-axis. Consequently, much larger voxels are generated for distant regions. Furthermore, dynamic voxelization [11] avoids hard-coding the number of points within each voxel, maximizing points utilization without dropping any points, hence minimizing information loss. The FCVSQF module employs the centroid of points within each voxel as the query source and target instead of the voxel center, thus preserving the original 3D geometry and accurately representing voxel features. In order to save memory and prevent incurring background voxel noise, we focus on a few important foreground cubic centroid-voxels for local features query and fusion in cylindrical voxels. Additionally, we design a loss function to ensure the sampling of foreground centroid-voxels. The OCVT module further enhances the refined cubic voxel features by capturing long-range object-level global information via transformer [12]. It specially attends to voxels surrounding the object center.

The three modules unite to develop the final enhanced cubic voxels for compact and accurate detection. Extensive experiments on public benchmarks demonstrate that SVFNeXt significantly boosts the detection performance due to sparse voxel fusion, especially on small and distant objects. Meanwhile, we also show comparable results with the state-of-the-art methods on large objects (e.g., car, vehicle).

2 Related Work

2.1 Voxel-Based 3D Detectors

Mainstream voxel-based methods [2,3,4, 7, 8, 22] typically partition the point cloud into cubic voxels and extract features using sparse convolutions. [2] utilizes more efficient 3D sparse convolutions to accelerate VoxelNet [6]. [7] collapses voxels into pillars along the z-axis and employs 2D convolutions to speed up. [3] refines proposals with RoI-grid pooling in a second stage. [4, 8] aggregate voxel features using key points for box refinement. [22] addresses uneven point cloud density by considering point density within voxels. Although the regular grid structure of cubic voxelization enables efficient feature extraction with CNN, the receptive field is limited by the convolutional kernel. In contrast, our method enlarges the receptive field indirectly through cross-representation query.

2.2 Fusion-Based 3D Detectors

Fusion-based methods can be broadly categorized into multi-modal and multi-representation fusion. The former absorbs data from different sensors (e.g., LiDAR and camera), and has been supported by numerous methods [13, 15, 16]. Some [13, 15] encode features from different modalities separately and fuse at the proposal-level or in the BEV feature map, while [16] employs attention mechanisms for feature fusion and alignment. However, feature misalignment and the additional branch may impact efficiency and real-time performance. The latter usually fuses data from the same source (e.g., LiDAR). [10, 11] attempt point-level fusion of different views, but they may introduce noise and have limited impact on the receptive field. Nevertheless, our approach selectively enhances foreground centroid-voxels, with another alternative LiDAR representation to expand the receptive field and leverage complementary information.

2.3 Transformer-Based 3D Detectors

Transformer [12] has recently demonstrated its superiority in 2D vision tasks. Exploiting the permutation invariance of point clouds, applying transformer to 3D vision is a favorable choice. In pioneer works [17,18,19,20,21], attention mechanisms are employed at different stages of the 3D detection pipeline (e.g., 3D backbone [17,18,19], dense head [20], RoI head [21, 22]) to learn contextual information. However, directly applying vanilla transformer to massive point clouds is infeasible in terms of time and space. Therefore, we focus specifically on voxels near the object center to capture long-range dependencies.

3 SVFNeXt for 3D Object Detection

We propose a sparse cross-representation voxel feature fusion and refinement method called SVFNeXt, integrating voxel-level features during the sparse features extraction. Our objective is to make minimal modifications and provide a simple and efficient plugin that can be easily incorporated into generic detection pipeline, as illustrated in Fig. 1. SVFNeXt consists primarily of three modules: Dynamic Distance-aware Cylindrical Voxelization (Sect. 3.1), Foreground Centroid-Voxel Selection-Query-Fusion (Sect. 3.2), and Object-aware Center-Voxel Transformer (Sect. 3.3).

3.1 Dynamic Distance-Aware Cylindrical Voxelization

To maintain the 3D geometric structure of objects in point clouds, adapt to the rotational scanning manner of LiDAR and varying sparsity of point clouds, we introduce dynamic distance-aware cylindrical voxelization, as shown in Fig. 2. This technique converts points from Cartesian coordinate to Cylindrical coordinate and partitions voxels unevenly along the $\rho $-axis without dropping any points, unlike [9].

Given a point cloud $P_{\textrm{cart}} = \{(x_i, y_i, z_i)\}_{i = 1}^{N_\textrm{p}}$ defined in Cartesian coordinate system, its Cylindrical coordinate representation is calculated as $P_{\textrm{cyl}} = \{(\rho _i, \varphi _i, z_i)\}_{i=1}^{N_{\textrm{p}}}$, where

$$\begin{aligned} \rho _i = \sqrt{x_i^2 + y_i^2} \qquad \varphi _i = \arctan (\frac{y_i}{x_i}) \qquad z_i = z_i \end{aligned}$$

(1)

where $N_{\textrm{p}}$ is the number of points in the point cloud.

Dynamic voxelization [11] means points are assigned to the volume space of the grid dynamicly based on their spatial coordinates. As for cylindrical points set $P_{\textrm{cyl}}$ and voxels set $V_{\textrm{cyl}}$, voxelization can be described as a bidirectional mapping between points and voxels, formally,

$$\begin{aligned} V_{\textrm{cyl}} &= \{ v_j \mid \mathcal {M}_{v}(p_i) = v_j, p_i \in P_{\textrm{cyl}}, \forall i\}_{j=1}^{M} \end{aligned}$$

(2)

$$\begin{aligned} \mathcal {M}_p(v_j) &= \{ p_i \mid \forall p_i \in v_j, v_j \in V_{\textrm{cyl}}\} \end{aligned}$$

(3)

where M is the number of non-empty voxels, $\mathcal {M}_{v}(\cdot )$ denotes mapping from point to voxel, $\mathcal {M}_p(\cdot )$ denotes mapping from voxel to point.

Distance-aware cylindrical voxelization involves unequal partition across different $\rho $ intervals along the $\rho $-axis in the Cylindrical coordinate system. Thus, the farther away from the origin(i.e., LiDAR, O in Fig. 2), the sparser points, the larger voxels, allowing more points to reside in, as shown in Fig. 2(b). Define voxel size as $V_s = (\varDelta \rho , \varDelta \varphi , \varDelta z)$, discussed by cases,

$$\begin{aligned} V_s = {\left\{ \begin{array}{ll} (\varDelta \rho _1, \varDelta \varphi , \varDelta z), &{} 0 \leqslant \rho < \rho _1 \\ (\varDelta \rho _2, \varDelta \varphi , \varDelta z), &{} \rho _1 \leqslant \rho < \rho _2 \\ (\varDelta \rho _3, \varDelta \varphi , \varDelta z), &{} \rho \geqslant \rho _2 \end{array}\right. } \end{aligned}$$

(4)

where $\varDelta \rho _1 < \varDelta \rho _2 < \varDelta \rho _3$, we can also term $[0, \rho _1)$ as close, $[\rho _1, \rho _2)$ as medium and $[\rho _2, +\infty )$ as far.

3.2 Foreground Centroid-Voxel Selection-Query-Fusion

Various approaches [3, 4, 8, 21] have been explored to determine the voxel center as a representation of voxel feature position. However, they tend to treat voxels with different point distributions equally, inevitably misleading model and overlooking important geometric details. Observed by [22], we adopt voxel centroid as a position representative to achieve accurate feature querying. Besides, centroid aligns well with our DDCV module, which captures the distribution pattern of points within each voxel. Hence, we should first locate the centroid of each voxel after initial cubic and cylindrical voxelization.

Let’s assume cylindrical voxels set $V_{\textrm{cyl}} = \{v_j = \{I_{\textrm{cyl}}^{v_j}, F_{\textrm{cyl}}^{v_j}\}\}_{j=1}^M$, cubic voxels set $V_{\textrm{cub}} = \{ u_i = \{I_{\textrm{cub}}^{u_i}, F_{\textrm{cub}}^{u_i}\}\}_{i=1}^N$, for each representation, $I \in \mathbb {R}^3$ is the index of voxel and $F \in \mathbb {R}^{3+\textrm{c}}$ is the corresponding voxel feature, $\textrm{c}$ is the number of channels of extra features(e.g., intensity, elongation). To illustrate, taking $V_{\textrm{cyl}}$ as an example, voxel centroid can be computed by taking the average spatial coordinates of the points within the voxel. Specifically, for $v_j \in V_{\textrm{cyl}}$, centroid

$$\begin{aligned} \mathcal {C}_{\textrm{cyl}}^j = \frac{1}{\mathcal {N}(v_j)} \sum _{p_j \in v_j} p_j \end{aligned}$$

(5)

where $p_j = (\rho _j, \varphi _j, z_j)$, $\mathcal {N}(v_j)$ is the number of points within the voxel $v_j$. Thus, for cubic voxels set $V_{\textrm{cub}}$, we can also compute voxel centroid $\mathcal {C}_{\textrm{cub}}^i$ the same way as Eq. 5.

Centroid-Voxel Features Retrieval. After obtaining the voxel centroids from two different voxel representations, we perform Scale and Group operations to get centroids $\mathbf {\mathcal {C}}_*^{s} \in \mathbb {R}^{n_* \times 3}$ and corresponding voxel(i.e., centroid-voxel) indices $\textbf{I}^{{\mathcal {C}}_s}_* \in \mathbb {R}^{n_* \times 3}$ from feature map $\textbf{F}_*^s \in \mathbb {R}^{N_* \times c_s}$ from 3D sparse CNN at scale s. Then, Search the whole sparse feature map $\textbf{F}_*^s$ for centroid-voxels based on indices and retrieve the associated voxel features $\textbf{F}_{*}^{{\mathcal {C}}_s} \in \mathbb {R}^{n_* \times c_s}$, Here, $*=\{\textrm{cub, cyl}\}$, $n_* < N_*$, $c_s$ is the channels of voxel features. Formally, given the initial voxel indices $\textbf{I}_*$, voxel centroids ${\mathbf {\mathcal {C}}}_*$ and downsample factors $D = \{1, 2, 4, 8\}$ of feature map $\textbf{F}_*$ at each scale,

$$\begin{aligned} \textbf{F}_*^{{\mathcal {C}}_s} = \mathcal {S}_2\left( \mathcal {G}\left( \mathcal {S}_1(\textbf{I}_*, D_s), \mathbf {\mathcal {C}}_*\right) , \textbf{F}_*^s\right) \end{aligned}$$

(6)

where $s \in \{1, 2, 3, 4\}$, $\mathcal {S}_1$ denotes Scale, $\mathcal {G}$ denotes Group, $\mathcal {S}_2$ denotes Search.

Centroid-Voxels Selection-Query-Fusion. To obtain refined cubic centroid-voxel features, it is crucial to select the foreground centroid-voxels for feature aggregation. We focus on those that are important rather than all. Thus avoiding background noise from cylindrical centroid-voxels, which offers no benefit to detection. Unlike [17], which uniformly aggregates features from all non-empty voxels. Moreover, our method ensures expanded effective receptive field while maintaining high efficiency.

We follow the three steps: foreground cubic centroid-voxels selection, cross-representation query, and fusion. Referring to Fig. 3, with centroid-voxels from both representations involved, we focus more on the cubic one following the common practice, and the other as an auxiliary. Firstly, we select the top-k centroid-voxels as the query source according to foreground scores. Then, perform MSG ball-query [1] within the cylindrical centroid-voxels based on the related centroids. This allows us to pool cylindrical features within a local range and provide more fine-grained geometric information. Finally, we fuse the pooled features from cylindrical centroid-voxels with the selected foreground cubic centroid-voxel features. As a result, we obtain the refined features.

Let’s assume that f, p, r denote foreground, pooled and refined, $\mathcal {S, Q, F}$ denote Selection, Query and Fusion, respectively, $\{(\textbf{F}_{\scriptscriptstyle {\textrm{cub}}}^{\mathcal {C}_s})_f, (\textbf{F}_{\scriptscriptstyle {{\textrm{cyl}}}}^{\mathcal {C}_s})_p, (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_f^r\} \in \mathbb {R}^{n_f \times c_s}$, $({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f \in \mathbb {R}^{n_f \times 3}$, $n_f$ is the number of selected foreground cubic centroid-voxels. Accordingly, the SQF part illustrated by Fig. 3 can be formulated as

$$\begin{aligned} \left[ (\textbf{F}_{\scriptscriptstyle {\textrm{cub}}}^{\mathcal {C}_s}) _f, ({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f\right] &= \mathcal {S}\left( \texttt {SubM3d}(\textbf{F}_{\scriptscriptstyle {\textrm{cub}}}^{\mathcal {C}_s}), {\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s} \right) \end{aligned}$$

(7)

$$\begin{aligned} (\textbf{F}_{\scriptscriptstyle {{\textrm{cyl}}}}^{\mathcal {C}_s})_p &= \texttt {Linear}\left( \mathcal {Q}\left( ({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f, {\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cyl}}}^{s}, \textbf{F}_{\scriptscriptstyle {{\textrm{cyl}}}}^{\mathcal {C}_s} \right) \right) \end{aligned}$$

(8)

$$\begin{aligned} (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_f^r &= \mathcal {F}\left( (\textbf{F}_{\scriptscriptstyle {\textrm{cub}}}^{\mathcal {C}_s})_f, (\textbf{F}_{\scriptscriptstyle {{\textrm{cyl}}}}^{\mathcal {C}_s})_p \right) \end{aligned}$$

(9)

3.3 Object-Aware Center-Voxel Transformer

Previously, we obtain refined foreground cubic centroid-voxel features marked as features of interest to attend. They include fine-grained features from another more informative cylindrical representation, partially compensating for the loss of object information represented by cubic voxels. However, they may potentially lack interaction due to independent feature aggregation. Furthermore, it is essential to incorporate global information into the feature for detecting small and distant objects. Therefore, we propose OCVT, guided by the object center, to effectively capture long-range context at object level, as shown in Fig. 4.

3D Sparse Heatmap Generation. Leveraging the selected foreground centroids $({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f$ from FCVSQF module, given each annotated bounding box $B_k$ centered at $(x_k ,y_k, z_k)$, we calculate the distance between center of $B_k$ and centroid $(\hat{x}_i^k, \hat{y}_i^k, \hat{z}_i^k)$ situated at $B_k$ in $({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_f$. Then, a 3D Gaussian kernel is applied to confine the heatmap response within the range of [0,1]. Formally,

$$\begin{aligned} \hat{H}_i = \exp \left( -\frac{ (x_k - \hat{x}_i^k)^2 + (y_k - \hat{y}_i^k)^2 + (z_k - \hat{z}_i^k)^2 }{2\sigma _k^2}\right) \in [0, 1] \end{aligned}$$

(10)

where $\sigma _k$ is an object size-adaptive standard deviation [24], $\hat{H}_i$ is the heatmap value generated at centroid i. Taking all centroids, we obtain the final target 3D Sparse Heatmap $\hat{\textbf{H}}$, calculating a loss with the predicted heatmap $\textbf{H}$. We thereby can carefully choose centroid-voxels closely aligned with the object center.

Center-Voxel Transformer. We focus solely on a subset of centroid-voxels closest to the object center to build object-level contextual dependencies, thereby improving efficiency. Similar to the Selection part in Fig. 3, the location of top K voxels based on the predicted heatmap scores will be taken out as the center-voxels. We denote center-voxel features as $(\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_{ctr}$, center centroids as $({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_{ctr}$, they are then fed into Transformer encoder block:

$$\begin{aligned} (\textbf{F}_{\scriptscriptstyle \textrm{cub}}^s)_{ctr} = \mathcal {T}(\textbf{Q}, \textbf{K}, \textbf{V}) \end{aligned}$$

(11)

$$\begin{aligned} \textbf{Q} = \textbf{W}_q (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_{ctr} + \textbf{E}_{pos}, \textbf{K} = \textbf{W}_k (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_{ctr} + \textbf{E}_{pos}, \textbf{V} = \textbf{W}_v (\textbf{F}_{\scriptscriptstyle {{\textrm{cub}}}}^{\mathcal {C}_s})_{ctr} \end{aligned}$$

(12)

where $\mathcal {T}$ denotes Transformer, Q, K, V are query, key, value features, $\textbf{E}_{pos}$ is positional embedding transformed by a linear layer applied to $({\mathbf {\mathcal {C}}}_{\scriptscriptstyle {\textrm{cub}}}^{s})_{ctr}$ .

Eventually, we scatter $(\textbf{F}_{\scriptscriptstyle \textrm{cub}}^s)_{ctr}$ back to the 3D sparse feature map at scale s, resulting in the enhanced cubic voxel features. The final enchanced features are equipped with both rich local features from cylindrical voxels, and long-range global contextual dependencies from object centers.

3.4 Loss Functions

The overall loss function comprises four components: foreground loss and heatmap loss from the 3D backbone, RPN loss and RCNN loss (for 2-stage models). We adhere to [3, 4] for RPN loss $\mathcal {L}_{\textrm{rpn}}$ and RCNN loss $\mathcal {L}_{\textrm{rcnn}}$. Regarding the FCVSQF module, we employ foreground loss $\mathcal {L}_{\textrm{fore}}$ computed by focal loss [26] with BCE. For the OCVT module, we utilize sparse heatmap loss $\mathcal {L}_{\textrm{hm}}$ calculated by smooth-L1 loss. The final loss is the weighted sum of the four parts above: $\mathcal {L} = w_1 \mathcal {L}_{\textrm{fore}} + w_2 \mathcal {L}_{\textrm{hm}} + w_3 \mathcal {L}_{\textrm{rpn}} + w_4 \mathcal {L}_{\textrm{rcnn}}$, the weights we used in our experiments are [1, 1, 1, 1], respectively.

4 Experiments

4.1 Datasets

KITTI. The KITTI dataset contains 7481 training samples and 7518 testing samples. Typically, the training data are divided into a train set with 3712 samples and a val set with 3769 samples. It uses average precision (AP) on easy, moderate and hard levels as evaluation metric.

Waymo Open Dataset. The WOD dataset consists of 798 sequences for training and 202 sequences for validation. The evaluation metrics include average precision (AP) and average precision weighted by heading (APH). We report the results on both LEVEL 1 (L1) and LEVEL 2 (L2) difficulty levels.

4.2 Implementation Details

For the cubic voxelization, we follow PV-RCNN [4] settings (i.e., voxel size and point cloud range) on both datasets. For the cylindrical voxelization, the ranges are [0, 80] m, [$-\pi $/2, $\pi $/2] rad, and [ −3, 1] m along the $\rho $, $\varphi $ and z axis, respectively, with voxel size of (0.05 m, $\pi $/180 rad, 0.1 m) on KITTI. While on WOD, the ranges are [0, 107.84] m, [$-\pi $, $\pi $] rad and [−2, 4] m along the $\rho $, $\varphi $ and z axis. The voxel size is ($\varDelta \rho $, $\pi $/360 rad, 0.15 m), $\varDelta \rho $ varies as the DDCV module illustrated: the ranges across distance are [0, 30.24) m, [30.24, 50.24) m, [50.24, 107.84] m, where the $\varDelta \rho $ is set as 0.09 m, 0.10 m and 0.15 m accordingly.

4.3 Main Results

KITTI. With [2,3,4,5] as our baselines, the experimental results on val and test set are presented in Table 1 and Table 2, respectively. Our model exhibits notable improvements in both 3D and BEV mAP(e.g., 1.17%, 1.67%, and 3.02% 3D AP on Mod. level). Notably, our approach significantly enhances performance for Ped. and Cyc. categories at the moderate difficulty while maintaining strong capability for Car class. A visual comparison shown in Fig. 5 explains our method can better detect, align and orient objects. Moreover, our method shows competitive results on the test set and further verifys the effectiveness of our method.

Table 1. Performance comparison of 3D/BEV detection with AP R40 on KITTI val set. $\dagger $: re-implemented by [27]. $\ddagger $: reported by [27]. SVF: Sparse Voxel Fusion.

Full size table

Table 2. Performance comparison with different models on the KITTI test set for Car and Cyclist. The top-2 best performances are highlighted in bold.

Full size table

Table 3. Performance comparison of 3D detection on WOD val set, training with 20% train set. $\ddagger $: reported by [27]. SVF: Sparse Voxel Fusion.

Full size table

Table 4. The detection results on WOD val set, training with full train set. $\dagger $: re-implemented by [27] with kernel size as 3 in 3D backbone.

Full size table

Table 5. Effect of each component on WOD val set with [2] as baseline (), training with 20% train set. DRCV: Regular voxelization in Fig. 2.

Full size table

WOD. We conduct experiments on the large-scale WOD and report the results in Table 3 and Table 4 on the val set. As shown in Table 3, our method consistently improves performance across all categories, similar to what we observe in KITTI. Notably, our method significantly outperforms baselines on mAPH (L2), with margins of 3.55%, 2.15%, 1.56%, and 2.63%, particularly for small objects. Furthermore, we summarize the comparison between our approach and state-of-the-art methods in Table 4.

4.4 Ablation Study

We conduct ablation studies on each proposed module shown in Table 5. The unified effect of the three modules results in a significant gain of 3.6% mAPH (L2) on both overall and far range (i.e., 50 m–inf).

DDCV. We observe that distance-aware voxelization (DDCV) outperforms regular voxelization (DRCV). The former shows improvements of 1.5%, 1.5% and 2.1% overall APH (L2) for Veh., Ped., and Cyc., respectively. This confirms the capability of cylindrical voxels to provide richer information and refine object representation.

FCVSQF. It utilizes foreground centroids for feature fusion, preserving original geometric shape information. This optimization helps refine foreground voxel features and expands the receptive field through local query, particularly benefiting sparse, small and distant objects. Notably, it brings a performance gain of 1.5% mAPH (L2) on far range.

OCVT. Guided by object center, it models long-range contextual dependencies at object level with center-voxels, further refining the sparse voxel features. This brings a slight performance gain of 0.7% and 0.9% mAPH (L2) on overall and far range, respectively.

5 Conclusion

In this paper, we propose SVFNeXt, a plug-and-play fusion-based 3D backbone that can be applied to most voxel-based 3D detectors. As a rarely explored approach, we address the limitations of conventional cubic voxels by leveraging cylindrical voxels with more uniform points distribution, providing richer information for accurate detection. Our centroid-based cross-voxel query and local features fusion partially alleviate the issue of incomplete object representation in cubic voxels, incorporating fine-grained features and enlarging receptive field. What’s more, the object-level global information learning further refines feature representations, benefiting the subsequent detection. Extensive experiments on the public benchmarks serve as a compelling evidence of our model efficacy.

Note that our approach falls slightly short of the state-of-the-art methods on larger objects in some case. Our subsequent endeavor involves delving into model generalization to narrow this gap and enhance its performance.

References

Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NIPS, pp. 5105–5114 (2017)
Google Scholar
Yan, Y., Mao, Y., Li, B.: SECOND: sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
Article Google Scholar
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3D object detection. In: AAAI, pp. 1201–1209 (2021)
Google Scholar
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: CVPR, pp. 10529–10538 (2020)
Google Scholar
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: CVPR, pp. 11784–11793 (2021)
Google Scholar
Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: CVPR, pp. 4490–4499 (2018)
Google Scholar
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: PointPillars: fast encoders for object detection from point clouds. In: CVPR, pp. 12697–12705 (2019)
Google Scholar
Shi, S., et al.: PV-RCNN++: point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vision 131(2), 531–551 (2023). https://doi.org/10.1007/s11263-022-01710-9
Article Google Scholar
Zhu, X., et al.: Cylindrical and asymmetrical 3D convolution networks for lidar segmentation. In: CVPR, pp. 9939–9948 (2021)
Google Scholar
Zhou, Y., et al.: End-to-end multi-view fusion for 3D object detection in lidar point clouds. In: CoRL, pp. 923–932 (2020)
Google Scholar
Wang, Y.: Pillar-based object detection for autonomous driving. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_2
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Google Scholar
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR, pp. 1907–1915 (2017)
Google Scholar
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: CVPR, pp. 770–779 (2019)
Google Scholar
Liang, T., et al.: BEVFusion: a simple and robust lidar-camera fusion framework. In: NeurIPS, pp. 10421–10434 (2022)
Google Scholar
Li, Y., et al.: DeepFusion: lidar-camera deep fusion for multi-modal 3D object detection. In: CVPR, pp. 17182–17191 (2022)
Google Scholar
Mao, J., et al.: Voxel transformer for 3D object detection. In: CVPR, pp. 3164–3173 (2021)
Google Scholar
He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: a set-to-set approach to 3D object detection from point clouds. In: CVPR, pp. 8417–8427 (2022)
Google Scholar
Sun, P., et al.: SWFormer: sparse window transformer for 3D object detection in point clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, pp. 426–442. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20080-9_25
Chapter Google Scholar
Zhou, Z., Zhao, X., Wang, Yu., Wang, P., Foroosh, H.: CenterFormer: center-based transformer for 3D object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII, pp. 496–513. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19839-7_29
Chapter Google Scholar
Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: ICCV, pp. 2743–2752 (2021)
Google Scholar
Hu, J.S., Kuai, T., Waslander, S.L.: Point density-aware voxels for lidar 3D object detection. In: CVPR, pp. 8469–8478 (2022)
Google Scholar
Chen, Y., Liu, J., Zhang, X., Qi, X., Jia, J.: VoxelNeXt: fully sparse voxelnet for 3D Object detection and tracking. In: CVPR, pp. 21674–21683 (2023)
Google Scholar
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 765–781. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_45
Chapter Google Scholar
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: point-based 3D single stage object detector. In: CVPR, pp. 11040–11048 (2020)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)
Google Scholar
Team O.D.: OpenPCDet: an open-source toolbox for 3D object detection from point clouds (2020). https://github.com/open-mmlab/OpenPCDet

Download references

Acknowledgements

This work is supported in part by the National Key Research and Development Project under Grant 2019YFB2102300, in part by the National Natural Science Foundation of China under Grant 61936014, 62076183, 61976159, in part by the Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0100, in part by the Shanghai Science and Technology Innovation Action Plan Project No. 22511105300 and 20511100700, in part by the Natural Science Foundation of Shanghai under Grant 20ZR147350, in part by the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

College of Electronic and Information Engineering, Tongji University, Shanghai, China
Deze Zhao & Shengjie Zhao
School of Software Engineering, Tongji University, Shanghai, China
Shengjie Zhao & Shuang Liang

Authors

Deze Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shengjie Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shuang Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shengjie Zhao or Shuang Liang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Fenrong Liu
SEEK Limited, Cremorne, NSW, Australia
Arun Anand Sadanandan
MIMOS Berhad, Kuala Lumpur, Malaysia
Duc Nghia Pham
Universitas Indonesia, Depok, Indonesia
Petrus Mursanto
Tabcorp Holdings Limited, Melbourne, VIC, Australia
Dickson Lukose

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, D., Zhao, S., Liang, S. (2024). SVFNeXt: Sparse Voxel Fusion for LiDAR-Based 3D Object Detection. In: Liu, F., Sadanandan, A.A., Pham, D.N., Mursanto, P., Lukose, D. (eds) PRICAI 2023: Trends in Artificial Intelligence. PRICAI 2023. Lecture Notes in Computer Science(), vol 14327. Springer, Singapore. https://doi.org/10.1007/978-981-99-7025-4_17

Download citation

DOI: https://doi.org/10.1007/978-981-99-7025-4_17
Published: 10 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7024-7
Online ISBN: 978-981-99-7025-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SVFNeXt: Sparse Voxel Fusion for LiDAR-Based 3D Object Detection

Abstract

Similar content being viewed by others

A Small-Size 3d Object Detection Network for Analyzing the Sparsity of Raw Lidar Point Cloud

Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection

SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds

Keywords

1 Introduction