Abstract
Two-stage detectors have gained much popularity in 3D object detection. Most two-stage 3D detectors utilize grid points, voxel grids, or sampled keypoints for RoI feature extraction in the second stage. Such methods, however, are inefficient in handling unevenly distributed and sparse outdoor points. This paper solves this problem in three aspects. 1) Dynamic Point Aggregation. We propose the patch search to quickly search points in a local region for each 3D proposal. The dynamic farthest voxel sampling is then applied to evenly sample the points. Especially, the voxel size varies along the distance to accommodate the uneven distribution of points. 2) RoI-graph Pooling. We build local graphs on the sampled points to better model contextual information and mine point relations through iterative message passing. 3) Visual Features Augmentation. We introduce a simple yet effective fusion strategy to compensate for sparse LiDAR points with limited semantic cues. Based on these modules, we construct our Graph R-CNN as the second stage, which can be applied to existing one-stage detectors to consistently improve the detection performance. Extensive experiments show that Graph R-CNN outperforms the state-of-the-art 3D detection models by a large margin on both the KITTI and Waymo Open Dataset. And we rank first place on the KITTI BEV car detection leaderboard.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In autonomous driving, 3D object detection is an essential task that has received substantial attention from industry [1, 12, 38] and academia [24, 39, 47]. Among the existing 3D detection methods, two-stage detectors [25, 42] outperform most single-stage detectors [12, 38] in accuracy due to the proposal refinement stage. Previous two-stage methods [7, 14, 17, 24,25,26, 40] have explored different RoI pooling methods to capture better proposal features for refinement. PointRCNN [25] and its subsequent works [14] sample keypoints from original point clouds near the 3D proposal and extract the features of the sampled points. Part-A\(^2\) Net [26] divides each 3D proposal into regular voxel grids and applies sparse convolutions to capture the features of voxel grids. PV-RCNN [24] and its variants [7, 17, 35] sample grid points within each 3D proposal and use the set abstraction [22] to aggregate the features of grid points. The methods that utilize the sampled keypoints show more flexibility than others since they directly process raw points and avoid the predefined voxel grids [16, 40] or grid points [6, 31] as intermediaries for RoI feature extraction.
Nevertheless, existing methods relied on sampled points still have some problems: 1) They ignore that points are unevenly distributed in different parts of an object, thus yield sub-optimal sampling strategy. As Fig. 1(a) shows, points for some parts are too sparse to preserve the structure information, which will hinder the prediction of the object’s size. 2) The point interrelation is not adequately utilized to model the contextual information of sparse points for object detection. 3) Sparse LiDAR points in a single proposal provide limited semantic cues, which easily leads to a series of points that resemble a part of an object yielding high classification scores. Figure 1(c) shows that a wall corner is wrongly detected as a car. To surmount the above challenges, we introduce three modules:
1) Dynamic Point Aggregation (DPA). To efficiently and effectively group and sample context and object points for 3D proposals, we propose patch search (PS) and dynamic farthest voxel sampling (DFVS). We will start with PS to speed up grouping then move on to DFVS to solve the uneven distribution problem during sampling.
Previous methods [5, 25, 26] group points by searching all points to determine whether they belong to a proposal, which is time-consuming since the theoretical time complexity is O(NM), where N and M are the number of points and proposals, respectively. Especially for the detection on Waymo Open Dataset [28], there are often about 180K points and 500 proposals per frame that need to be processed. In contrast, PS only searches the points falling in patches occupied by the proposal to group corresponding context and object points. Then, DFVS sample keypoints from the grouped points to well retain the objects’ structure, as shown in Fig. 1(b). Specifically, instead of sampling points directly [14, 23, 25], DFVS splits each proposal into evenly distributed voxels and iteratively samples the most distant non-empty voxels (i.e., voxels involving at least one point). Further, to ensure the efficiency and accuracy of sampling, we resort to dynamic voxel size. That is, for nearby objects with many points, we use a large voxel size to reduce the sampling complexity. While for distant objects with sparse points, we use a relatively small voxel size to preserve the geometric details.
2) RoI-graph Pooling (RGP). To alleviate the missing detection, we utilize the graph neural network (GNN) to build connections among points to better model contextual information through iterative message passing. Compared with previous point-based methods [21, 22], the GNN allows more complex features to be determined along the edges and avoids grouping and sampling the points repeatedly. Specifically, RGP constructs local graphs in each 3D proposal, which treats the sampled points as graph nodes. To compensate for the information loss caused by downsampling, we use PointNet [21] to encode neighbor points of each node into the initial features. Then, RGP iteratively aggregates messages from its neighbors on a k-NN graph to mine the relations among nodes. Finally, we propose multi-level attentive fusion (MLAF) to capture abundant spatial features from multi-level nodes with different receptive fields and fully exploit graph nodes to extract robust RoI features.
3) Visual Features Augmentation (VFA). Though LiDAR points provide accurate depth information, the lack of sufficient semantic features makes it difficult to distinguish objects with similar geometric structures. Thus, a simple yet effective fusion method is used to fuse geometric features from LiDAR and semantic features from images for suppressing the false positives, as shown in Fig. 1(d). We decorate local graphs with image features by bilinear interpolation since graph nodes serve as a natural bridge between the LiDAR and image. We train the two streams in an end-to-end manner and show the complex multi-modality cut-and-paste augmentation [30, 45] is not necessary for our framework.
Based on the three modules, we present our Graph R-CNN that can replace the second stage of other two-stage detectors or supplement any one-stage detector for further improvement. Extensive experiments have been conducted on several detection benchmarks to verify the effectiveness of our approach. We consistently improve existing 3D detectors by a large margin and achieve new state-of-the-art results on both Waymo Open Dataset (WOD) [28] and KITTI [8].
Our contributions can be summarized as follows:
-
We fully consider the uneven distribution of point clouds and propose dynamic point aggregation (DPA).
-
We introduce RoI-graph Pooling (RGP) to capture the robust RoI features by iterative graph-based message passing.
-
We demonstrate a simple yet effective fusion strategy (VFA) to fuse image features with point features during the refinement stage.
-
We present an accurate and efficient 3D object detector (Graph R-CNN) that can be applied to existing 3D detectors. Extensive experiments are conducted to verify the effectiveness of our methods.
2 Related Works
3D Object Detection Using Point Cloud. Current 3D detectors can be mainly divided into two streams: one-stage and two-stage methods. One-stage detectors jointly predict an output class and location of objects at the projected volumetric grids or downsampled points. SECOND [38] rasterizes point cloud into 3D voxels and accelerates VoxelNet [50] by exploiting sparse 3D convolution. PointPillars [12] partitions points into pillars rather than voxels. 3DSSD [39] proposes a fusion farthest point sampling strategy by utilizing both feature and geometry distance for better classification performance.
Two-stage detectors first use a region proposal network (RPN) to generate coarse object proposals and then use a dedicated per-region head to classify and refine them. PointRCNN [25] generates RoIs based on foreground points from the scene and conducts canonical 3D box refinement after point cloud region pooling. PV-RCNN [24] incorporates the advantage from 3D voxel Convolutional Neural Network and Point-based set abstraction to learn discriminative point cloud features. Voxel R-CNN [7] proposes a voxel RoI pooling to extract RoI features directly from voxel features to refine proposals in the second stage. CenterPoint [42] detects centers of objects using a keypoint detector and refines these estimates using additional point features on the object.
3D Object Detection Using Multi-modality Fusion. Recently, much progress has been made to exploit the advantages of the camera and LiDAR sensors. MV3D [4] generates 3D proposals from the bird’s eye view and fuses multi-view features via region-based representation. EPNet [11] proposes LI-Fusion module to fuse the deep features of point clouds and camera images in a point-wise paradigm. However, insufficient multi-modality augmentation makes these methods perform only marginally better or sometimes worse than approaches that only use point cloud. Recent works [30, 45] overcome the constraint by extending the cut-and-paste augmentation [38] to multi-modality methods. But a complex process is needed to avoid collisions between objects in both point cloud and 2D imagery domain. PointPainting [29] augments LiDAR points with segmentation scores, which are suboptimal to cover color and textures in images.
3D Object Detection Using Graph Neural Networks. Graph Neural Networks [9] are introduced to model intrinsic relationships of graph-structured data. Since they are suitable for processing 3D point clouds, some works have adopted GNNs for 3D object detection. 3DVID [41] explores spatial relations among different grid regions by treating the non-empty pillar grids as graph nodes to enhance pillar features. Object DGCNN [33] uses DGCNN [34] to construct a graph between the queries for incorporating neighborhood information in object detection estimates. DOPS [19] creates a graph where the points are connected to those with similar center predictions for consolidating the per-point object predictions. Point-GNN [27] encodes point clouds in a fixed radius near-neighbors graph and predicts the category and shape of the object that each node in the graph belongs to. Our work differs from previous works by constructing local graphs during the refinement stage, which greatly saves computational and memory overhead since the k-nearest neighbor algorithm can be parallelly applied in each 3D proposal, and numerous background points can be avoided to build graphs.
3 Methods
In this section, we present the design of Graph R-CNN, as shown in Fig. 2. We first introduce dynamic point aggregation in Sect. 3.1. Next, we will demonstrate RoI-graph pooling in Sect. 3.2. Then, we will illustrate how to incorporate semantic features from the image into our framework in Sect. 3.3. Finally, we will show the definition of the loss function in Sect. 3.4.
3.1 Dynamic Point Aggregation
In this section, we present a differentiable dynamic point aggregation (DPA) to efficiently and effectively group and sample points and their features for each proposal. We first enlarge each proposal’s size by \(\sigma \) to wrap enough object and context points. Then, DPA uses patch search (PS) to quickly group the points in each enlarged proposal and dynamic farthest voxel sampling (DFVS) to evenly sample the grouped points.
Patch Search. Unlike previous works [14, 23, 25] that need to search all points to determine whether they belong to an enlarged proposal, we divide the entire scene into patches and only search the points falling in patches occupied by the proposal, as shown in Fig. 3. PS consists of three major steps: 1) We turn the rotated box into an axis-aligned box to make it easier to find the occupied patches. 2) We build point2patch and patch2box index arrays, which store the point and patch indices as keys, and the corresponding patch and box indices as values, respectively. 3) We finally group the points for each proposal according to the point2patch and patch2box index arrays, as shown in Fig. 3(a). We note that all the steps can be conducted in parallel on GPUs. In this way, we reduce the theoretical time complexity from O(NM) to O(QK), where Q is the number of points that fall in all occupied patches, and K is the predefined maximum number of boxes per patch since the same patch may be occupied by multiple boxes. Notably, Q and K are much smaller than N and M, respectively.
Dynamic Farthest Voxel Sampling. Since the number of raw points in a box is usually far more than that of the sampled points (e.g., 70112 vs. 256, as Fig. 4(b) shows), it’s nontrivial to ensure every part of an object is sampled. Therefore, we propose DFVS to balance sampling efficiency and accuracy. To be specific, DFVS partitions proposals into evenly distributed voxels and then iteratively sample the most distant non-empty voxels. Considering the number of points varies with the distance of the box, as Fig. 4(a) shows, the voxel size should be changed dynamically according to the distance to ensure the sampling efficiency of nearby objects and accuracy of distant objects, as shown in Fig. 3. Formally, the voxel size \(V_i\) of the box \(b_i\) can be calculated by:
where \((x_i, y_i, z_i)\) is the i-th box’s center, and \(\lambda \) and \(\delta \) determine the relationship between the voxel size and the distance from the box to the LiDAR sensor.
Assume we have grouped the points in the box \(b_i\) by patch search and obtained \(\mathcal {P}_i=\{p_j^i=\left[ x_j^i, y_j^i, z_j^i, r_j^i\right] \in \mathbb {R}^4: j = 1,\cdots ,N\}\), where \((x_j^i, y_j^i, z_j^i)\) and \(r_j^i\) indicate j-th point’s coordinate in the i-th box’s canonical coordinate system and the reflectance intensity. We assign each point to evenly divided voxel grids, and the grid index of the point \(p_j^i\) is represented as \(\{g_j^i=(\lfloor \frac{x_j^i}{V_i}\rfloor , \lfloor \frac{y_j^i}{V_i}\rfloor , \lfloor \frac{z_j^i}{V_i}\rfloor )\}\). Each non-empty voxel can be represented by a randomly selected point in the voxel. Next, farthest point sampling (FPS) [22] is applied to iteratively sample the most distant non-empty voxels.
A potential problem with DVFS lies in that, for the distant box, a small voxel size will divide the box into numerous voxels, of which the non-empty voxels only occupy a small part. And to utilize the parallel computation of GPUs, voxel grids of other boxes need to be padded to the largest grid number, which will increase the memory overhead, as shown in Fig. 3(b). Since we only care about non-empty voxels, we use a hash table [18] to record the hashed grid indices of non-empty voxels and quadratic probing to resolve the collisions in the hash table.
3.2 RoI-Graph Pooling
In this section, we describe the process of RoI-graph pooling, as shown in Fig. 2, which treats sampled points as nodes to build local graphs in 3D proposals. It consists of graph construction, iteration, and aggregation.
Graph Construction. Given sampled points \(\overline{\mathcal {P}}=\{p_j=\left[ x_j, y_j, z_j, r_j\right] \in \mathbb {R}^4: j = 1,\cdots ,T\}\) for each proposal b (we drop the i subscript for ease of notation), we construct a local graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E})\), where node \(v_j\in \mathcal {V}\) represents a sampled point \(p_j\in \overline{\mathcal {P}}\), and edge \(e_j^k\in \mathcal {E}\) indicates the connection between node \(v_j\) and \(v_k\). To reduce the computational overhead, we define \(\mathcal {G}\) as a k-nearest neighbor (k-NN) graph, which is built from the geometric distance among different nodes. Despite efficient, building graphs on down-sampled points inevitably loss fine-grained features. Thus, we use PointNet [21] to encode original neighbor points within a radiu r for each node. We note that neighbor query only induces a marginal computational overhead because it is only conducted for each proposal.
The same graph nodes may be wrapped by different proposals, which will result in the same pooling features and thus introduce ambiguity in the refinement stage [14, 26]. Inspired by [23], we add the 3D proposal’s local corners (i.e., the corners are transformed to the proposal’s canonical coordinate system) for each node to make them have the ability to discriminate differences. In our experiments, we found that two diagonal corners are sufficient. Formally, the initial state \(s_j^0\) of each node \(v_j\) at iteration step \(t=0\) can be represented by:
where \(\left[ \cdot ,\cdot \right] \) is concatenation function, \(f_j\) is the features from PointNet, and \(u_j\) and \(w_j\) are two diagonal corners of the 3D proposal.
Graph Iteration. To mine the rich geometric relations among nodes, we iteratively pass the message on \(\mathcal {G}\) and update the node’s state at each iteration step. Concretely, at step t, a node \(v_j\) aggregates information from all the neighbor nodes \(v_k \in \mathcal {N}_{v_j}\) in the k-NN graph. Following [2, 33, 41], we use EdgeConv [34] to update the state \(s_j^{t+1}\):
where \(\phi _\theta \) is parameterized by a Multilayer Perceptron (MLP).
Graph Aggregation. To capture robust RoI features, we propose multi-level attentive fusion (MLAF) to fuse the nodes’ features, as shown in Fig. 5. Specifically, we concatenate the nodes’ features \([s_j^1, \cdots , s_j^T]\) from different iterations and feed them into several MLPs to learn the channel-wise weights. Then, we reweight \([s_j^1, \cdots , s_j^T]\) to enhance the features for final detection. After that, it’s nontrivial to fully utilize every graph node for the proposal refinement. We explore several aggregation operations, e.g., the channel-wise Transformer [23], Set Transformer [13], attention sum, average pooling, and max pooling. Our final model uses max pooling as it provides the best empirical performance. Then, Dropout is used in later MLPs to reduce overfitting, and a shortcut connection is added to fuse more features without adding much cost.
3.3 Visual Features Augmentation
Cut-and-paste augmentation (CPA) [38] is widely used for 3D object detection to increase the training samples, which could speed up training convergence and improve the detection performance. Since our Graph R-CNN extracts RoI features directly from raw point clouds, we doubt whether CPA is required for our model. We carefully study its influence in Sect. 4.4 and find that our model does not depend on it. Thus, CPA is only used to pretrain the RPN and disabled when training the whole framework.
For the camera image, we extract high-level semantic features using a pretrained 2D detector. Then, we apply two \(1\times 1\) convolutional kernels to reduce the dimensionality of the output feature, as Fig. 2 shows. The benefit brought by it is twofold. Firstly, it can learn to select features that contribute greatly to the performance of the refinement. Secondly, it can ease the optimization to fuse low-dimensionality point features with high-dimensionality image features. Then, we project the graph node to the location in the camera image and collect the feature vector at that pixel in the camera image through bilinear interpolation. Lastly, the features will be appended to \(s_j^0\) for each node \(v_j\).
3.4 Loss Functions
Classification Loss. For class-agnostic confidence score prediction, we follow [24, 42] to use a score target \(I_i\) guided by the box’s 3D IoU with the corresponding ground-truth bounding box:
where \(\textrm{IoU}_i\) is the IoU between the i-th proposal box and the ground truth. The training is supervised with a binary cross entropy loss:
where \(\hat{I}_i\) is the predicted confidence score, and B is the number of sampled region proposals at the training stage.
Regression Loss. For box prediction, we transform the 3D proposal \(b_i = (x_i, y_i, z_i,\) \( l_i, w_i, h_i, \theta _i)\) and the corresponding 3D ground-truth bounding box \(b_i^{gt} = (x_i^{gt}, y_i^{gt},\) \( z_i^{gt}, l_i^{gt}, w_i^{gt}, h_i^{gt}, \theta _i^{gt})\) from the global reference frame to the canonical coordinate system of 3D proposal:
where \(\varDelta \theta _{i}=\theta _{i}^{gt}-\theta _{i}\). Then, the regression targets for center \(t_{i}^{c}\), size \(t_{i}^{s}\), and orientation \(t_{i}^{o}\) can be defined as:
Having all the targets \(t_i=(t_i^c,t_i^s,t_i^o)\), our regression loss is defined as:
where \(o_i\) is the output of the model’s regression branch, and \(B_{+}\) is the number of positive samples.
Total Loss. Finally, the overall loss is formulated as:
where \(\alpha \) is a hyperparameter to balance the loss, which is 1 by default.
4 Experiments
4.1 Datasets
Waymo Open Dataset is a large-scale autonomous driving dataset consisting of 798 scenes for training and 202 scenes for validation. The evaluation protocol consists of average precision (AP) and average precision weighted by heading (APH). It includes two difficulty levels: LEVEL_1 denotes objects containing more than 5 points, and LEVEL_2 denotes objects containing at least 1 point.
KITTI contains 7481 training samples and 7518 testing samples in autonomous driving scenes. We follow [4, 7, 24] to divide the training data into a train set with 3712 samples and a val set with 3769 samples. The performance on the val set and the test leaderboard are reported.
4.2 Implementation Settings
Implementation Details. The codebase of CenterPoint is used for WOD. Then, we replace the second stage of CenterPoint-Voxel with our method (i.e., Graph-Ce) and train the network separately. For the dynamic point aggregation, we sample 256 points for each proposal and set \(\sigma =0.4\). In dynamic farthest voxel sampling, we set hash size as 4099 and \(\lambda \) and \(\delta \) as 0.18 and 50, respectively. In patch search, the K and patch size are set to 32 and 1.0, respectively. For the RoI-graph pooling, we set \(r=0.4\) and the embedding channels to [16, 16] in PointNet. We update the graph with \(T=3\), and the output dimensions of the three iterations are [32, 32, 64]. The number k of nearest neighbors is set as 8. In MLAF, the embedding dimension is 256, and the dropout ratio is 0.1.
For KITTI, the codebase of OpenPCDet is used. We propose Graph-Pi, Graph-Vo, and Graph-Po that use the pillar-based PointPillars, the voxel-based SECOND, and the point-based 3DSSD as their region proposal networks, respectively. We incorporate the image branch in Graph-Vo (i.e., Graph-VoI) to compare with previous multi-modality methods. For 2D detector, we use the CenterNet [48] with DLA-34 [44] backbone, which takes images with a resolution of \(1280\times 384\) as input. The dimension of the output features will be reduced to 32 by the feature reduction layer.
Training Details. For WOD, we use the same training schedules and assignment strategies as CenterPoint-Voxel. The second stage is trained for 6 epochs on 4 GTX 1080Ti GPUs with 8 batch size per GPU.
For KITTI, we use the same training configuration as PV-RCNN and train the whole model end-to-end for 80 epochs on 4 GTX 1080Ti GPUs with 4 batch size per GPU, and the pretrained RPN and 2D detector are frozen during training. For 2D detector, we pretrain CenterNet on WOD for 24 epochs and finetune it on KITTI for 12 epochs. We use Adam optimizer with one-cycle policy and set batch size to 2 and learning rate to 0.00025.
4.3 Comparison with State-of-the-Art Methods
Waymo Open Dataset. We compare Graph-Ce for the vehicle class at different distances on the full WOD validation with previous methods. Table 1 shows that Graph-Ce achieves the state-of-the-art results in both level 1 and level 2 among all the published papers with a single frame LiDAR input. In Table 2, we present our results for the vehicle, pedestrian, and cyclist classes. Compared with our baseline, i.e., CenterPoint-Voxel\(^\ddag \), our method improves the 3D APH in level 2 for the vehicle, pedestrian, and cyclist by 5.93%, 6.35%, and 2.90%, respectively.
KITTI. We compare Graph-Pi, Graph-Vo, Graph-Po, and Graph-VoI with previous methods listed in Table 3. Graph-Pi achieves the fastest inference speed among all two-stage methods. Compared with methods using only LiDAR as input, Graph-Po ranks the 1\(^{st}\) place in 3D AP and BEV AP with competitive inference speed. Graph-VoI outperforms all previous multi-modality methods by a large margin (+2.08% for 3D easy AP and +2.6% for 3D moderate AP). Table 4 shows our method could consistently improve PointPillars, SECOND, and 3DSSD by a large margin, demonstrating the efficacy of the method.
4.4 Ablation Study
Analysis of the Dynamic Point Aggregation. The third and fourth rows in Table 5 show that the dynamic point aggregation (DPA) contributes an improvement of 1.06% and 0.87% APH at level 2 for vehicle and pedestrian, respectively. Especially, Table 6 shows that DPA improves the baseline by 1.49% APH at 0–30m since the nearby objects suffer more from the uneven distribution problem. The first and second rows in Table 8 show that the patch search (PS) is 35\(\times \) faster than the baseline, i.e., point cloud region pooling (PR) [25].
We also explore several sampling strategies in Table 8 to solve the uneven distribution problem, i.e., voxel sampling (VS), dynamic voxel sampling (DVS), dynamic farthest voxel sampling (DFVS), and farthest point sampling (FPS). We note that VS is a special case of DVS when \(\delta \) is large, and FPS is a special case of DFVS when \(\lambda \) is small. Table 8 shows that DFVS achieves the best trade-off between accuracy and efficiency. Further, we explore the use of DFVS on point-based 3D object detectors as an alternative to FPS. Table 7 shows that DFVS achieves similar results with FPS but costs less runtime.
Analysis of the RoI-Graph Pooling. The first and second rows in Table 5 show that the RoI-graph pooling (RGP) raises the APH at level 2 for vehicle and pedestrian by 3.31% and 4.12%, respectively. In Table 9, we study the effect of the number of iterations on the detection accuracy, where the number of neighbors is set to 8 by default to save GPU memory. This result suggests it is beneficial to iterate more times to mine geometric relations. Besides, we find accuracy gains for distant objects are greater than for nearby objects since the contextual information is better modeled to alleviate the missing detection of distant objects. For graph construction, we investigate the components used in the initial state of the graph node. The fourth and fifth rows in Table 5 show that using PointNet (PN) raises 0.21% APH for vehicle class since the downsampling introduces the loss of fine-grained details, and the fifth and sixth rows show that adding two diagonal corners (DC) for each node raises 1.06% APH for both vehicle and pedestrian. For graph aggregation, the second and third rows in Table 5 show that multi-level attentive fusion (MLAF) boosts the performance by 0.29% APH for vehicle class. Furthermore, we study influences of different aggregation methods in Table 10, i.e., channel-wise Transformer (CT), Set Transformer (ST), attention sum (AS), average pooling (AP), and max pooling (MP). Transformer does not achieve better results, probably because it has more parameters leading to overfitting.
Analysis of the Visual Features Augmentation. By comparing the fourth and fifth rows in Table 4, we observe that the image feature raises the detection results by 2.34%, 0.75%, and 0.8% 3D AP respectively in terms of easy, moderate, and hard. We carefully analyze the effect of the cut-and-paste augmentation (CPA) at different stages in Table 11. We find that the refinement stage is hardly affected by CPA. Thus, we can conveniently train the LiDAR branch and the image branch end-to-end without the help of CPA. We also provide the study of different image features, i.e., the RGB of the input image, the segmentation scores, and the output features of the 2D detector. We find that using the 2D features achieves the best results.
5 Conclusions
We present an accurate and efficient 3D object detector Graph R-CNN that can be applied to existing 3D detectors. Our framework can handle the unevenly distributed and sparse point clouds by utilizing the dynamic point aggregation and the semantic-decorated local graph.
References
Bewley, A., Sun, P., Mensink, T., Anguelov, D., Sminchisescu, C.: Range conditioned dilated convolutions for scale invariant 3D object detection. In: Conference on Robot Learning (2020)
Chai, Y., et al.: To the point: efficient 3D object detection in the range image with graph convolution kernels (2021)
Chen, C., Chen, Z., Zhang, J., Tao, D.: SASA: semantics-augmented set abstraction for point-based 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Chen, Y., Liu, S., Shen, X., Jia, J.: Fast point R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Cheng, B., Sheng, L., Shi, S., Yang, M., Xu, D.: Back-tracing representative points for voting-based 3D object detection in point clouds (2021)
Deng, J., Shi, S., Li, P., Zhou, W., Zhang, Y., Li, H.: Voxel R-CNN: towards high performance voxel-based 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)
Gori, M., Monfardini, G., Scarselli, F.: A new model for learning in graph domains. In: Proceedings of the IEEE International Joint Conference on Neural Networks (2005)
He, C., Zeng, H., Huang, J., Hua, X.S., Zhang, L.: Structure aware single-stage 3D object detection from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Huang, T., Liu, Z., Chen, X., Bai, X.: EPNet: enhancing point features with image semantics for 3D object detection. In: Proceedings of the European Conference on Computer Vision (2020)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Lee, J., Lee, Y., Kim, J., Kosiorek, A.R., Choi, S., Teh, Y.W.: Set transformer: a framework for attention-based permutation-invariant neural networks. In: Proceedings of the International Conference on Machine Learning (2019)
Li, Z., Wang, F., Wang, N.: Lidar R-CNN: an efficient and universal 3D object detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusion for 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Liu, Z., Xu, G., Yang, H., Liu, H., Cai, D.: SparsePoint: fully end-to-end sparse 3D object detector. CoRR abs/2103.10042 (2021)
Mao, J., Niu, M., Bai, H., Liang, X., Xu, H., Xu, C.: Pyramid R-CNN: towards better performance and adaptability for 3D object detection. In: Proceedings of the IEEE International Conference on Computer Vision (2021)
Mao, J., et al.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE International Conference on Computer Vision (2021)
Najibi, M., et al.: DOPS: learning to detect 3D objects and predict their 3D shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Pang, S., Morris, D.D., Radha, H.: CloCS: camera-lidar object candidates fusion for 3D object detection. In: International Conference on Intelligent Robots and Systems (2020)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems (2017)
Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: Proceedings of the IEEE International Conference on Computer Vision (2021)
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Shi, W., Rajkumar, R.: Point-GNN: graph neural network for 3D object detection in a point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: PointPainting: sequential fusion for 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Wang, C., Ma, C., Zhu, M., Yang, X.: PointAugmenting: cross-modal augmentation for 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Wang, J., Lan, S., Gao, M., Davis, L.S.: InfoFocus: 3D object detection for autonomous driving with dynamic information modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 405–420. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_24
Wang, Y., et al.: Pillar-based object detection for autonomous driving. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_2
Wang, Y., Solomon, J.: Object DGCNN: 3D object detection using dynamic graphs. In: Advances in Neural Information Processing Systems (2021)
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. (2019)
Wu, X., et al.: Sparse fuse dense: towards high quality 3D detection with depth completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022)
Xie, L., et al.: PI-RCNN: an efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)
Xu, Q., Zhou, Y., Wang, W., Qi, C.R., Anguelov, D.: SPG: unsupervised domain adaptation for 3D object detection via semantic point generation (2021)
Yan, Y., Mao, Y., Li, B.: Second: sparsely embedded convolutional detection. Sensors 18(10) (2018)
Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: point-based 3D single stage object detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: STD: sparse-to-dense 3D object detector for point cloud. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Yin, J., Shen, J., Guan, C., Zhou, D., Yang, R.: Lidar-based online 3D video object detection with graph-based message passing and spatiotemporal transformer attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2020)
Yin, T., Zhou, X., Krähenbühl, P.: Center-based 3D object detection and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Yoo, J.H., Kim, Y., Kim, J.S., Choi, J.W.: 3D-CVF: generating joint camera and lidar features using cross-view spatial feature fusion for 3D object detection. In: Proceedings of the European Conference on Computer Vision (2020)
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Zhang, W., Wang, Z., Loy, C.C.: Multi-modality cut and paste for 3D object detection. CoRR abs/2012.12741 (2020)
Zheng, W., Tang, W., Chen, S., Jiang, L., Fu, C.W.: CIA-SSD: Confident IoU-aware single-stage object detector from point cloud. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)
Zheng, W., Tang, W., Jiang, L., Fu, C.W.: SE-SSD: self-ensembling single-stage object detector from point cloud. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2021)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. CoRR abs/1904.07850 (2019)
Zhou, Y., et al.: End-to-end multi-view fusion for 3D object detection in lidar point clouds. In: Conference on Robot Learning (2019)
Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Acknowledgments.
This work was supported in part by The National Key Research and Development Program of China (Grant Nos: 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant Nos: 62036009, U1909203, 61936006, 62133013), in part by Innovation Capability Support Program of Shaanxi (Program No. 2021TD-05).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, H. et al. (2022). Graph R-CNN: Towards Accurate 3D Object Detection with Semantic-Decorated Local Graph. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_38
Download citation
DOI: https://doi.org/10.1007/978-3-031-20074-8_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20073-1
Online ISBN: 978-3-031-20074-8
eBook Packages: Computer ScienceComputer Science (R0)