Keywords

1 Introduction

In autonomous driving, 3D object detection is an essential task that has received substantial attention from industry [1, 12, 38] and academia [24, 39, 47]. Among the existing 3D detection methods, two-stage detectors [25, 42] outperform most single-stage detectors [12, 38] in accuracy due to the proposal refinement stage. Previous two-stage methods [7, 14, 17, 24,25,26, 40] have explored different RoI pooling methods to capture better proposal features for refinement. PointRCNN [25] and its subsequent works [14] sample keypoints from original point clouds near the 3D proposal and extract the features of the sampled points. Part-A\(^2\) Net [26] divides each 3D proposal into regular voxel grids and applies sparse convolutions to capture the features of voxel grids. PV-RCNN [24] and its variants [7, 17, 35] sample grid points within each 3D proposal and use the set abstraction [22] to aggregate the features of grid points. The methods that utilize the sampled keypoints show more flexibility than others since they directly process raw points and avoid the predefined voxel grids [16, 40] or grid points [6, 31] as intermediaries for RoI feature extraction.

Nevertheless, existing methods relied on sampled points still have some problems: 1) They ignore that points are unevenly distributed in different parts of an object, thus yield sub-optimal sampling strategy. As Fig. 1(a) shows, points for some parts are too sparse to preserve the structure information, which will hinder the prediction of the object’s size. 2) The point interrelation is not adequately utilized to model the contextual information of sparse points for object detection. 3) Sparse LiDAR points in a single proposal provide limited semantic cues, which easily leads to a series of points that resemble a part of an object yielding high classification scores. Figure 1(c) shows that a wall corner is wrongly detected as a car. To surmount the above challenges, we introduce three modules:

Fig. 1.
figure 1

Illustration of different sampling strategies: (a) random point sampling (RPS) in existing works and the proposed (b) dynamic farthest voxel sampling (DFVS). And the comparison of results using (c) LiDAR and (d) LiDAR and Image. We show the ground truth in pink bounding boxes and our detected objects in green bounding boxes. (Color figure online)

1) Dynamic Point Aggregation (DPA). To efficiently and effectively group and sample context and object points for 3D proposals, we propose patch search (PS) and dynamic farthest voxel sampling (DFVS). We will start with PS to speed up grouping then move on to DFVS to solve the uneven distribution problem during sampling.

Previous methods [5, 25, 26] group points by searching all points to determine whether they belong to a proposal, which is time-consuming since the theoretical time complexity is O(NM), where N and M are the number of points and proposals, respectively. Especially for the detection on Waymo Open Dataset [28], there are often about 180K points and 500 proposals per frame that need to be processed. In contrast, PS only searches the points falling in patches occupied by the proposal to group corresponding context and object points. Then, DFVS sample keypoints from the grouped points to well retain the objects’ structure, as shown in Fig. 1(b). Specifically, instead of sampling points directly [14, 23, 25], DFVS splits each proposal into evenly distributed voxels and iteratively samples the most distant non-empty voxels (i.e., voxels involving at least one point). Further, to ensure the efficiency and accuracy of sampling, we resort to dynamic voxel size. That is, for nearby objects with many points, we use a large voxel size to reduce the sampling complexity. While for distant objects with sparse points, we use a relatively small voxel size to preserve the geometric details.

2) RoI-graph Pooling (RGP). To alleviate the missing detection, we utilize the graph neural network (GNN) to build connections among points to better model contextual information through iterative message passing. Compared with previous point-based methods [21, 22], the GNN allows more complex features to be determined along the edges and avoids grouping and sampling the points repeatedly. Specifically, RGP constructs local graphs in each 3D proposal, which treats the sampled points as graph nodes. To compensate for the information loss caused by downsampling, we use PointNet [21] to encode neighbor points of each node into the initial features. Then, RGP iteratively aggregates messages from its neighbors on a k-NN graph to mine the relations among nodes. Finally, we propose multi-level attentive fusion (MLAF) to capture abundant spatial features from multi-level nodes with different receptive fields and fully exploit graph nodes to extract robust RoI features.

3) Visual Features Augmentation (VFA). Though LiDAR points provide accurate depth information, the lack of sufficient semantic features makes it difficult to distinguish objects with similar geometric structures. Thus, a simple yet effective fusion method is used to fuse geometric features from LiDAR and semantic features from images for suppressing the false positives, as shown in Fig. 1(d). We decorate local graphs with image features by bilinear interpolation since graph nodes serve as a natural bridge between the LiDAR and image. We train the two streams in an end-to-end manner and show the complex multi-modality cut-and-paste augmentation [30, 45] is not necessary for our framework.

Based on the three modules, we present our Graph R-CNN that can replace the second stage of other two-stage detectors or supplement any one-stage detector for further improvement. Extensive experiments have been conducted on several detection benchmarks to verify the effectiveness of our approach. We consistently improve existing 3D detectors by a large margin and achieve new state-of-the-art results on both Waymo Open Dataset (WOD) [28] and KITTI [8].

Our contributions can be summarized as follows:

  • We fully consider the uneven distribution of point clouds and propose dynamic point aggregation (DPA).

  • We introduce RoI-graph Pooling (RGP) to capture the robust RoI features by iterative graph-based message passing.

  • We demonstrate a simple yet effective fusion strategy (VFA) to fuse image features with point features during the refinement stage.

  • We present an accurate and efficient 3D object detector (Graph R-CNN) that can be applied to existing 3D detectors. Extensive experiments are conducted to verify the effectiveness of our methods.

2 Related Works

3D Object Detection Using Point Cloud. Current 3D detectors can be mainly divided into two streams: one-stage and two-stage methods. One-stage detectors jointly predict an output class and location of objects at the projected volumetric grids or downsampled points. SECOND [38] rasterizes point cloud into 3D voxels and accelerates VoxelNet [50] by exploiting sparse 3D convolution. PointPillars [12] partitions points into pillars rather than voxels. 3DSSD [39] proposes a fusion farthest point sampling strategy by utilizing both feature and geometry distance for better classification performance.

Two-stage detectors first use a region proposal network (RPN) to generate coarse object proposals and then use a dedicated per-region head to classify and refine them. PointRCNN [25] generates RoIs based on foreground points from the scene and conducts canonical 3D box refinement after point cloud region pooling. PV-RCNN [24] incorporates the advantage from 3D voxel Convolutional Neural Network and Point-based set abstraction to learn discriminative point cloud features. Voxel R-CNN [7] proposes a voxel RoI pooling to extract RoI features directly from voxel features to refine proposals in the second stage. CenterPoint [42] detects centers of objects using a keypoint detector and refines these estimates using additional point features on the object.

3D Object Detection Using Multi-modality Fusion. Recently, much progress has been made to exploit the advantages of the camera and LiDAR sensors. MV3D [4] generates 3D proposals from the bird’s eye view and fuses multi-view features via region-based representation. EPNet [11] proposes LI-Fusion module to fuse the deep features of point clouds and camera images in a point-wise paradigm. However, insufficient multi-modality augmentation makes these methods perform only marginally better or sometimes worse than approaches that only use point cloud. Recent works [30, 45] overcome the constraint by extending the cut-and-paste augmentation [38] to multi-modality methods. But a complex process is needed to avoid collisions between objects in both point cloud and 2D imagery domain. PointPainting [29] augments LiDAR points with segmentation scores, which are suboptimal to cover color and textures in images.

3D Object Detection Using Graph Neural Networks. Graph Neural Networks [9] are introduced to model intrinsic relationships of graph-structured data. Since they are suitable for processing 3D point clouds, some works have adopted GNNs for 3D object detection. 3DVID [41] explores spatial relations among different grid regions by treating the non-empty pillar grids as graph nodes to enhance pillar features. Object DGCNN [33] uses DGCNN [34] to construct a graph between the queries for incorporating neighborhood information in object detection estimates. DOPS [19] creates a graph where the points are connected to those with similar center predictions for consolidating the per-point object predictions. Point-GNN [27] encodes point clouds in a fixed radius near-neighbors graph and predicts the category and shape of the object that each node in the graph belongs to. Our work differs from previous works by constructing local graphs during the refinement stage, which greatly saves computational and memory overhead since the k-nearest neighbor algorithm can be parallelly applied in each 3D proposal, and numerous background points can be avoided to build graphs.

3 Methods

In this section, we present the design of Graph R-CNN, as shown in Fig. 2. We first introduce dynamic point aggregation in Sect. 3.1. Next, we will demonstrate RoI-graph pooling in Sect. 3.2. Then, we will illustrate how to incorporate semantic features from the image into our framework in Sect. 3.3. Finally, we will show the definition of the loss function in Sect. 3.4.

3.1 Dynamic Point Aggregation

In this section, we present a differentiable dynamic point aggregation (DPA) to efficiently and effectively group and sample points and their features for each proposal. We first enlarge each proposal’s size by \(\sigma \) to wrap enough object and context points. Then, DPA uses patch search (PS) to quickly group the points in each enlarged proposal and dynamic farthest voxel sampling (DFVS) to evenly sample the grouped points.

Patch Search. Unlike previous works [14, 23, 25] that need to search all points to determine whether they belong to an enlarged proposal, we divide the entire scene into patches and only search the points falling in patches occupied by the proposal, as shown in Fig. 3. PS consists of three major steps: 1) We turn the rotated box into an axis-aligned box to make it easier to find the occupied patches. 2) We build point2patch and patch2box index arrays, which store the point and patch indices as keys, and the corresponding patch and box indices as values, respectively. 3) We finally group the points for each proposal according to the point2patch and patch2box index arrays, as shown in Fig. 3(a). We note that all the steps can be conducted in parallel on GPUs. In this way, we reduce the theoretical time complexity from O(NM) to O(QK), where Q is the number of points that fall in all occupied patches, and K is the predefined maximum number of boxes per patch since the same patch may be occupied by multiple boxes. Notably, Q and K are much smaller than N and M, respectively.

Fig. 2.
figure 2

The overall architecture. We take 3D proposals and points from the region proposal network (RPN) and 2D feature map from the 2D detector as inputs. We propose dynamic point aggregation to sample context and object points and visual features augmentation to decorate the points with 2D features. RoI-graph pooling serves sampled points as graph nodes to build local graphs for each 3D proposal. We iterate the graph for T times to mine the geometric features among the nodes. Finally, each node is fully utilized through graph aggregation to produce robust RoI features.

Dynamic Farthest Voxel Sampling. Since the number of raw points in a box is usually far more than that of the sampled points (e.g., 70112 vs. 256, as Fig. 4(b) shows), it’s nontrivial to ensure every part of an object is sampled. Therefore, we propose DFVS to balance sampling efficiency and accuracy. To be specific, DFVS partitions proposals into evenly distributed voxels and then iteratively sample the most distant non-empty voxels. Considering the number of points varies with the distance of the box, as Fig. 4(a) shows, the voxel size should be changed dynamically according to the distance to ensure the sampling efficiency of nearby objects and accuracy of distant objects, as shown in Fig. 3. Formally, the voxel size \(V_i\) of the box \(b_i\) can be calculated by:

$$\begin{aligned} V_i = \lambda \cdot e^{-\frac{\sqrt{x_i^2 + y_i^2 + z_i^2}}{\delta }} \end{aligned}$$
(1)

where \((x_i, y_i, z_i)\) is the i-th box’s center, and \(\lambda \) and \(\delta \) determine the relationship between the voxel size and the distance from the box to the LiDAR sensor.

Fig. 3.
figure 3

Illustration of dynamic point aggregation, which includes (a) patch search and (b) dynamic farthest voxel sampling. In (a), we use different colors to represent different keys and values. In (b), we flatten the voxel grids of each proposal for better display.

Assume we have grouped the points in the box \(b_i\) by patch search and obtained \(\mathcal {P}_i=\{p_j^i=\left[ x_j^i, y_j^i, z_j^i, r_j^i\right] \in \mathbb {R}^4: j = 1,\cdots ,N\}\), where \((x_j^i, y_j^i, z_j^i)\) and \(r_j^i\) indicate j-th point’s coordinate in the i-th box’s canonical coordinate system and the reflectance intensity. We assign each point to evenly divided voxel grids, and the grid index of the point \(p_j^i\) is represented as \(\{g_j^i=(\lfloor \frac{x_j^i}{V_i}\rfloor , \lfloor \frac{y_j^i}{V_i}\rfloor , \lfloor \frac{z_j^i}{V_i}\rfloor )\}\). Each non-empty voxel can be represented by a randomly selected point in the voxel. Next, farthest point sampling (FPS) [22] is applied to iteratively sample the most distant non-empty voxels.

A potential problem with DVFS lies in that, for the distant box, a small voxel size will divide the box into numerous voxels, of which the non-empty voxels only occupy a small part. And to utilize the parallel computation of GPUs, voxel grids of other boxes need to be padded to the largest grid number, which will increase the memory overhead, as shown in Fig. 3(b). Since we only care about non-empty voxels, we use a hash table [18] to record the hashed grid indices of non-empty voxels and quadratic probing to resolve the collisions in the hash table.

Fig. 4.
figure 4

The statistical plot of the (a) average and (b) maximum number of points in each ground truth on Waymo Open Dataset for vehicle and pedestrian.

3.2 RoI-Graph Pooling

In this section, we describe the process of RoI-graph pooling, as shown in Fig. 2, which treats sampled points as nodes to build local graphs in 3D proposals. It consists of graph construction, iteration, and aggregation.

Graph Construction. Given sampled points \(\overline{\mathcal {P}}=\{p_j=\left[ x_j, y_j, z_j, r_j\right] \in \mathbb {R}^4: j = 1,\cdots ,T\}\) for each proposal b (we drop the i subscript for ease of notation), we construct a local graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E})\), where node \(v_j\in \mathcal {V}\) represents a sampled point \(p_j\in \overline{\mathcal {P}}\), and edge \(e_j^k\in \mathcal {E}\) indicates the connection between node \(v_j\) and \(v_k\). To reduce the computational overhead, we define \(\mathcal {G}\) as a k-nearest neighbor (k-NN) graph, which is built from the geometric distance among different nodes. Despite efficient, building graphs on down-sampled points inevitably loss fine-grained features. Thus, we use PointNet [21] to encode original neighbor points within a radiu r for each node. We note that neighbor query only induces a marginal computational overhead because it is only conducted for each proposal.

The same graph nodes may be wrapped by different proposals, which will result in the same pooling features and thus introduce ambiguity in the refinement stage [14, 26]. Inspired by [23], we add the 3D proposal’s local corners (i.e., the corners are transformed to the proposal’s canonical coordinate system) for each node to make them have the ability to discriminate differences. In our experiments, we found that two diagonal corners are sufficient. Formally, the initial state \(s_j^0\) of each node \(v_j\) at iteration step \(t=0\) can be represented by:

$$\begin{aligned} s_j^0 = \left[ x_j, y_j, z_j, r_j, f_j, u_j, w_j\right] , \end{aligned}$$
(2)

where \(\left[ \cdot ,\cdot \right] \) is concatenation function, \(f_j\) is the features from PointNet, and \(u_j\) and \(w_j\) are two diagonal corners of the 3D proposal.

Fig. 5.
figure 5

Illustration of multi-level attentive fusion.

Graph Iteration. To mine the rich geometric relations among nodes, we iteratively pass the message on \(\mathcal {G}\) and update the node’s state at each iteration step. Concretely, at step t, a node \(v_j\) aggregates information from all the neighbor nodes \(v_k \in \mathcal {N}_{v_j}\) in the k-NN graph. Following [2, 33, 41], we use EdgeConv [34] to update the state \(s_j^{t+1}\):

$$\begin{aligned} s_j^{t+1} = \max _{v_k \in \mathcal {N}_{v_j}}{\phi _\theta ([s_k^t-s_j^t, s_j^t])}, \end{aligned}$$
(3)

where \(\phi _\theta \) is parameterized by a Multilayer Perceptron (MLP).

Graph Aggregation. To capture robust RoI features, we propose multi-level attentive fusion (MLAF) to fuse the nodes’ features, as shown in Fig. 5. Specifically, we concatenate the nodes’ features \([s_j^1, \cdots , s_j^T]\) from different iterations and feed them into several MLPs to learn the channel-wise weights. Then, we reweight \([s_j^1, \cdots , s_j^T]\) to enhance the features for final detection. After that, it’s nontrivial to fully utilize every graph node for the proposal refinement. We explore several aggregation operations, e.g., the channel-wise Transformer [23], Set Transformer [13], attention sum, average pooling, and max pooling. Our final model uses max pooling as it provides the best empirical performance. Then, Dropout is used in later MLPs to reduce overfitting, and a shortcut connection is added to fuse more features without adding much cost.

3.3 Visual Features Augmentation

Cut-and-paste augmentation (CPA) [38] is widely used for 3D object detection to increase the training samples, which could speed up training convergence and improve the detection performance. Since our Graph R-CNN extracts RoI features directly from raw point clouds, we doubt whether CPA is required for our model. We carefully study its influence in Sect. 4.4 and find that our model does not depend on it. Thus, CPA is only used to pretrain the RPN and disabled when training the whole framework.

For the camera image, we extract high-level semantic features using a pretrained 2D detector. Then, we apply two \(1\times 1\) convolutional kernels to reduce the dimensionality of the output feature, as Fig. 2 shows. The benefit brought by it is twofold. Firstly, it can learn to select features that contribute greatly to the performance of the refinement. Secondly, it can ease the optimization to fuse low-dimensionality point features with high-dimensionality image features. Then, we project the graph node to the location in the camera image and collect the feature vector at that pixel in the camera image through bilinear interpolation. Lastly, the features will be appended to \(s_j^0\) for each node \(v_j\).

3.4 Loss Functions

Classification Loss. For class-agnostic confidence score prediction, we follow [24, 42] to use a score target \(I_i\) guided by the box’s 3D IoU with the corresponding ground-truth bounding box:

$$\begin{aligned} I_i = \min \left( 1, \max \left( 0, 2\times \textrm{IoU}_{i} - 0.5\right) \right) , \end{aligned}$$
(4)

where \(\textrm{IoU}_i\) is the IoU between the i-th proposal box and the ground truth. The training is supervised with a binary cross entropy loss:

$$\begin{aligned} \mathcal {L}_{cls}=\frac{1}{B} \sum _{i=1}^{B} -I_{i} \log \left( \hat{I}_{i}\right) -\left( 1-I_{i}\right) \log \left( 1-\hat{I}_{i}\right) , \end{aligned}$$
(5)

where \(\hat{I}_i\) is the predicted confidence score, and B is the number of sampled region proposals at the training stage.

Regression Loss. For box prediction, we transform the 3D proposal \(b_i = (x_i, y_i, z_i,\) \( l_i, w_i, h_i, \theta _i)\) and the corresponding 3D ground-truth bounding box \(b_i^{gt} = (x_i^{gt}, y_i^{gt},\) \( z_i^{gt}, l_i^{gt}, w_i^{gt}, h_i^{gt}, \theta _i^{gt})\) from the global reference frame to the canonical coordinate system of 3D proposal:

$$\begin{aligned} \begin{aligned}&\tilde{b}_{i}=\left( 0,0,0, l_{i}, w_{i}, h_{i}, 0\right) , \\&\tilde{b}_{i}^{gt}=\left( x_{i}^{gt}-x_{i}, y_{i}^{gt}-y_{i}, z_{i}^{gt}-z_{i}, l_{i}^{gt}, w_{i}^{gt}, h_{i}^{gt}, \varDelta \theta _{i}\right) , \end{aligned} \end{aligned}$$
(6)

where \(\varDelta \theta _{i}=\theta _{i}^{gt}-\theta _{i}\). Then, the regression targets for center \(t_{i}^{c}\), size \(t_{i}^{s}\), and orientation \(t_{i}^{o}\) can be defined as:

$$\begin{aligned} \begin{aligned}&t_{i}^{c}=\left( x_{i}^{gt}-x_{i}, y_{i}^{g t}-y_{i}, z_{i}^{g t}-z_{i}\right) , \\&t_{i}^{s}=\left( l_{i}^{gt}-l_{i}, w_{i}^{gt}-w_{i}, h_{i}^{gt}-h_{i}\right) , \\&t_{i}^{o}=\varDelta \theta _{i} - \lfloor \frac{\varDelta \theta _{i}}{\pi } + 0.5\rfloor \times \pi . \end{aligned} \end{aligned}$$
(7)

Having all the targets \(t_i=(t_i^c,t_i^s,t_i^o)\), our regression loss is defined as:

$$\begin{aligned} \mathcal {L}_{reg}=\frac{1}{B_{+}} \sum _{i=1}^{B_{+}} {\text {L1}}\left( o_{i}-t_{i}\right) , \end{aligned}$$
(8)

where \(o_i\) is the output of the model’s regression branch, and \(B_{+}\) is the number of positive samples.

Total Loss. Finally, the overall loss is formulated as:

$$\begin{aligned} \mathcal {L}=\mathcal {L}_{cls}+\alpha \mathcal {L}_{reg}, \end{aligned}$$
(9)

where \(\alpha \) is a hyperparameter to balance the loss, which is 1 by default.

4 Experiments

4.1 Datasets

Waymo Open Dataset is a large-scale autonomous driving dataset consisting of 798 scenes for training and 202 scenes for validation. The evaluation protocol consists of average precision (AP) and average precision weighted by heading (APH). It includes two difficulty levels: LEVEL_1 denotes objects containing more than 5 points, and LEVEL_2 denotes objects containing at least 1 point.

KITTI contains 7481 training samples and 7518 testing samples in autonomous driving scenes. We follow [4, 7, 24] to divide the training data into a train set with 3712 samples and a val set with 3769 samples. The performance on the val set and the test leaderboard are reported.

4.2 Implementation Settings

Implementation Details. The codebase of CenterPoint is used for WOD. Then, we replace the second stage of CenterPoint-Voxel with our method (i.e., Graph-Ce) and train the network separately. For the dynamic point aggregation, we sample 256 points for each proposal and set \(\sigma =0.4\). In dynamic farthest voxel sampling, we set hash size as 4099 and \(\lambda \) and \(\delta \) as 0.18 and 50, respectively. In patch search, the K and patch size are set to 32 and 1.0, respectively. For the RoI-graph pooling, we set \(r=0.4\) and the embedding channels to [16, 16] in PointNet. We update the graph with \(T=3\), and the output dimensions of the three iterations are [32, 32, 64]. The number k of nearest neighbors is set as 8. In MLAF, the embedding dimension is 256, and the dropout ratio is 0.1.

For KITTI, the codebase of OpenPCDet is used. We propose Graph-Pi, Graph-Vo, and Graph-Po that use the pillar-based PointPillars, the voxel-based SECOND, and the point-based 3DSSD as their region proposal networks, respectively. We incorporate the image branch in Graph-Vo (i.e., Graph-VoI) to compare with previous multi-modality methods. For 2D detector, we use the CenterNet [48] with DLA-34 [44] backbone, which takes images with a resolution of \(1280\times 384\) as input. The dimension of the output features will be reduced to 32 by the feature reduction layer.

Training Details. For WOD, we use the same training schedules and assignment strategies as CenterPoint-Voxel. The second stage is trained for 6 epochs on 4 GTX 1080Ti GPUs with 8 batch size per GPU.

For KITTI, we use the same training configuration as PV-RCNN and train the whole model end-to-end for 80 epochs on 4 GTX 1080Ti GPUs with 4 batch size per GPU, and the pretrained RPN and 2D detector are frozen during training. For 2D detector, we pretrain CenterNet on WOD for 24 epochs and finetune it on KITTI for 12 epochs. We use Adam optimizer with one-cycle policy and set batch size to 2 and learning rate to 0.00025.

Table 1. Vehicle detection results on WOD validation sequences. CenterPoint-Voxel\(^\dag \) is reproduced by us based on the officially released code. CenterPoint-Voxel\(^\ddag \) is the first stage of CenterPoint-Voxel\(^\dag \).
Table 2. Vehicle, pedestrian, and cyclist results on WOD validation sequences.

4.3 Comparison with State-of-the-Art Methods

Waymo Open Dataset. We compare Graph-Ce for the vehicle class at different distances on the full WOD validation with previous methods. Table 1 shows that Graph-Ce achieves the state-of-the-art results in both level 1 and level 2 among all the published papers with a single frame LiDAR input. In Table 2, we present our results for the vehicle, pedestrian, and cyclist classes. Compared with our baseline, i.e., CenterPoint-Voxel\(^\ddag \), our method improves the 3D APH in level 2 for the vehicle, pedestrian, and cyclist by 5.93%, 6.35%, and 2.90%, respectively.

KITTI. We compare Graph-Pi, Graph-Vo, Graph-Po, and Graph-VoI with previous methods listed in Table 3. Graph-Pi achieves the fastest inference speed among all two-stage methods. Compared with methods using only LiDAR as input, Graph-Po ranks the 1\(^{st}\) place in 3D AP and BEV AP with competitive inference speed. Graph-VoI outperforms all previous multi-modality methods by a large margin (+2.08% for 3D easy AP and +2.6% for 3D moderate AP). Table 4 shows our method could consistently improve PointPillars, SECOND, and 3DSSD by a large margin, demonstrating the efficacy of the method.

Table 3. Performance comparison on the KITTI testing sever for 3D car detection. L and I represent the LiDAR point cloud and the camera image, respectively.
Table 4. Performance of our model on the KITTI val set with AP calculated by 40 recall positions for car class. \(^\dag \) indicates our reproduced results.
Table 5. Ablation study of every module: RoI-graph pooling (RGP), multi-level attentive fusion (MLAF), dynamic point aggregation (DPA), PointNet (PN), and diagonal corners (DC). We show the 3D APH at level 2 on the WOD validation set.
Table 6. Ablation study of the performance of DPA at different distances. We show the level 2 APH on the WOD validation set for vehicle class.
Table 7. Ablation study of FPS and DFVS based on 3DSSD. We report the mAP on KITTI val set for car class and the runtime of sampling.
Table 8. Ablation study of different sampling and searching strategies. \(^\dag \) is tested by us based on the officially released code.

4.4 Ablation Study

Analysis of the Dynamic Point Aggregation. The third and fourth rows in Table 5 show that the dynamic point aggregation (DPA) contributes an improvement of 1.06% and 0.87% APH at level 2 for vehicle and pedestrian, respectively. Especially, Table 6 shows that DPA improves the baseline by 1.49% APH at 0–30m since the nearby objects suffer more from the uneven distribution problem. The first and second rows in Table 8 show that the patch search (PS) is 35\(\times \) faster than the baseline, i.e., point cloud region pooling (PR) [25].

We also explore several sampling strategies in Table 8 to solve the uneven distribution problem, i.e., voxel sampling (VS), dynamic voxel sampling (DVS), dynamic farthest voxel sampling (DFVS), and farthest point sampling (FPS). We note that VS is a special case of DVS when \(\delta \) is large, and FPS is a special case of DFVS when \(\lambda \) is small. Table 8 shows that DFVS achieves the best trade-off between accuracy and efficiency. Further, we explore the use of DFVS on point-based 3D object detectors as an alternative to FPS. Table 7 shows that DFVS achieves similar results with FPS but costs less runtime.

Analysis of the RoI-Graph Pooling. The first and second rows in Table 5 show that the RoI-graph pooling (RGP) raises the APH at level 2 for vehicle and pedestrian by 3.31% and 4.12%, respectively. In Table 9, we study the effect of the number of iterations on the detection accuracy, where the number of neighbors is set to 8 by default to save GPU memory. This result suggests it is beneficial to iterate more times to mine geometric relations. Besides, we find accuracy gains for distant objects are greater than for nearby objects since the contextual information is better modeled to alleviate the missing detection of distant objects. For graph construction, we investigate the components used in the initial state of the graph node. The fourth and fifth rows in Table 5 show that using PointNet (PN) raises 0.21% APH for vehicle class since the downsampling introduces the loss of fine-grained details, and the fifth and sixth rows show that adding two diagonal corners (DC) for each node raises 1.06% APH for both vehicle and pedestrian. For graph aggregation, the second and third rows in Table 5 show that multi-level attentive fusion (MLAF) boosts the performance by 0.29% APH for vehicle class. Furthermore, we study influences of different aggregation methods in Table 10, i.e., channel-wise Transformer (CT), Set Transformer (ST), attention sum (AS), average pooling (AP), and max pooling (MP). Transformer does not achieve better results, probably because it has more parameters leading to overfitting.

Table 9. Ablation study of the number of iterations to update the graph.
Table 10. Ablation study of different methods to aggregate nodes’ features.
Table 11. Ablation study of image features and CPA. \(^\dag \) and \(^\ddag \) indicate CPA is used in RPN and refinement, respectively.

Analysis of the Visual Features Augmentation. By comparing the fourth and fifth rows in Table 4, we observe that the image feature raises the detection results by 2.34%, 0.75%, and 0.8% 3D AP respectively in terms of easy, moderate, and hard. We carefully analyze the effect of the cut-and-paste augmentation (CPA) at different stages in Table 11. We find that the refinement stage is hardly affected by CPA. Thus, we can conveniently train the LiDAR branch and the image branch end-to-end without the help of CPA. We also provide the study of different image features, i.e., the RGB of the input image, the segmentation scores, and the output features of the 2D detector. We find that using the 2D features achieves the best results.

5 Conclusions

We present an accurate and efficient 3D object detector Graph R-CNN that can be applied to existing 3D detectors. Our framework can handle the unevenly distributed and sparse point clouds by utilizing the dynamic point aggregation and the semantic-decorated local graph.