Keywords

1 Introduction

This paper focuses on 3D object detection, which is a fundamental and key computer vision problem impacting most intelligent robotics perception systems including autonomous vehicles and drones. To achieve robust and accurate scene understanding, autonomous vehicles are usually equipped with various sensors (e.g. camera, Radar, LiDAR) with different functions, and multiple sensing modalities can be fused to exploit their complementary properties. However, developing a reliable and accurate perception system for autonomous driving based on multiple sensors is still a very challenging task.

Recently, 2D object detection with the power of deep learning has drawn much attention. LiDAR-based 3D object detection also becomes popular with deep learning. Point clouds generated by LiDAR to capture surrounding objects and return accurate depth and reflection intensity information to reconstruct the objects. Since the sparse and unordered attributes of point clouds, representative works either convert raw point clouds into bird-eye-view (BEV) pseudo images [1,2,3,4], 2D front view images [2], or structured voxels grid representations [5,6,7]. Some references [8,9,10] directly deal with raw point clouds by multi-layer perceptron (MLP) to estimate the 3D object and localization. However, due to the sparsity of point clouds, these LiDAR-based approaches suffer information loss severely in long-range regions and when dealing with small objects.

On the other hand, 2D RGB images provide dense texture descriptions and also enough information for small objects based on high resolution, but it is still hard to get precise 3D localization information due to the loss of depth information caused by perspective projection, particularly when using monocular camera [11,12,13]. Even if using stereo images [14], the accuracy of the estimated depth cannot be guaranteed, especially under poor weather, dark and unseen scenes. Therefore, some approaches [15,16,17,18,19] have attempted to take the mutual advantage of point clouds and 2D images. However, they either utilize Early Fusion, Late Fusion, or Middle Fusion is shown in Fig. 1 to shallowly fuse two kinds of features from 2D images and point clouds. Their approaches make the result inaccurate and not stable.

Fig. 1.
figure 1

A comparison of existed fusion methods and the deep dense fusion (proposed). Compared with methods (a–e), the deep dense fusion moves forward to the feature extraction phase and becomes denser. The proposed fusion method fully integrates each other’s characteristics.

MV3D [2] and AVOD [6] fuse region-based multi-modal features at the region proposal network (RPN) and detection stage, the local fusion method causes the loss of semantic and makes its results inaccurate. Conversely, ContFusion [20] proposed a global fusion method to fuse BEV features and image features from different feature levels, it verifies the superiority of the full fusion of 2D image and point clouds. However, ContFusion [20] is only unidirectional fusion. Based on logical experience, a bidirectional fusion will be even more superior than the unidirectional fusion. The challenge lies in the fact that the image feature is dense at discrete state, while LiDAR points are continuous and sparse. Thus, fusing them in both directions is non-trivial.

This paper proposes a two-stage multi-sensor 3D detector, called DDF3D, which fuses image feature and BEV feature at different levels of resolution. The DDF3D is an end-to-end learnable architecture consisting of a 3D region proposal subnet (RPN) and a refined detector subnet in the order illustrated in Fig. 2. First, the raw point clouds are partitioned into six-channel pseudo images and 2D images are cropped based on the central region. Second, two identical fully convolutional networks are used to extract view-specific features and fuse them by the MVRL simultaneously. Third, 3D anchors are generated from BEV, and anchor-dependent features from different views are fused to produce 3D non-oriented region proposals. Finally, the proposal-dependent features are fused again and fed to the refined subnetwork to regress dimension, orientation, and classify category.

Fig. 2.
figure 2

The architecture of deep dense fusion 3D object detection network.

Here, the contributions in this paper have summarized in 3 points as follows:

  1. 1.

    A highly efficient multi-view resizes layer (MVRL) designed to resize the features from BEV and camera simultaneously, which makes real-time fusion of multiple view-specific feature maps possible.

  2. 2.

    Based on the MVRL, a deep dense fusion method is proposed to fully fuse view-specific feature maps at different levels of resolution synchronously. The fusion method allows different feature maps to be fully fused during feature extraction, which greatly improves the detection accuracy of small size object.

  3. 3.

    The proposed architecture achieves a higher and robust 3D detection and localization accuracy for car, bicycle, and pedestrian class. Especially the proposed architecture greatly improves the accuracy of small classes on both 2D and 3D.

2 The Proposed Architecture

The main innovation of proposed DDF3D, depicted in Fig. 2, is to fully fuse view-specific features simultaneously based on the MVRL, and the fused features are fed into the next convolutional layers at BEV stream and camera stream respectively, the detailed procedure is shown in Fig. 1f. After feature extractor, both feature maps are fused again and the 3D RPN is utilized to generate 3D non-oriented region proposals, which are fed to the refined detection subnetwork for dimension refinement, orientation estimation, and category classification.

Birds Eye View Representation

Like MV3D [2] and AVOD [15], a six-channel BEV map is generated by encoding the height and density in each voxel of each LiDAR frame. Especially, the height is the absolute height relative to the ground. First, the raw point clouds are located in \( \left[ { - 40,40} \right] \times \left[ {0,70} \right] \) m and limited to the field of camera view. Along X and Y axis, the point clouds are voxelized at the resolution of 0.1 m. Then, the voxelized point clouds are equally sliced 5 slices between [−2.3, 0.2] m along the Z axis. Finally, the point density in each cell computed as min \( \left( {1.0, \frac{{\log \left( {N + 1} \right)}}{\log 64}} \right) \), where N is the number of points in a pillar. Note that the density features are computed for the whole point clouds while the height feature is computed for 5 slices, thus a 700 × 800 × 6 BEV feature is generated for each LiDAR frame. In addition to output a feature map, each LiDAR frame also outputs the voxelized points to construct the MVRL.

2.1 The Feature Extractor and Multi-view Resize Layer

This section will introduce the feature extractor and MVRL. The MVRL is used to resize the view-specific features at a different resolution, then view-specific features are concatenated with the features resized from different views. Finally, the fused features are fed into the next convolutional layers.

The Multi-view Resize Layer.

To fuse feature maps from different perspectives is not easy since the feature maps from different views are of different sizes. Also, fusion efficiency is a challenge. So, the multi-view resize layer is designed to bridge multiple intermediate layers on both sides to resize multi-sensor features at multiple scales with highly efficient. The inputs of MVRL contains three parts: the source BEV indices \( I_{bev/ori} \) and LiDAR points \( P_{ori} \) obtained during a density feature generation, the camera feature \( f_{cam} \), and the BEV feature \( f_{bev} \). The workflow of the MVRL shown in Fig. 3. The MVRL consists of data preparation and bidirectional fusion. In data preparation, the voxelized points \( P_{ori} \) are projected onto the camera plane, the process is formulated as Eq. 1 and Eq. 2, and the points \( P_{cam} \) in original image size \( 360 \times 1200 \) are kept. The points \( P_{{{{cam} \mathord{\left/ {\vphantom {{cam} {fusion}}} \right. \kern-0pt} {fusion}}}} \) in image size \( H_{i} \times W_{i} \) are used to obtained image indices \( I_{{{{camcam} \mathord{\left/ {\vphantom {{camcam} {fusion}}} \right. \kern-0pt} {fusion}}}} \) based on bilinear interpolation. A new BEV index \( I_{{{{bev} \mathord{\left/ {\vphantom {{bev} {fusion}}} \right. \kern-0pt} {fusion}}}} \) are obtained based on BEV indices \( I_{{{{bev} \mathord{\left/ {\vphantom {{bev} {ori}}} \right. \kern-0pt} {ori}}}} \) and BEV size \( H_{b} \times W_{b} \). Then, a sparse tensor \( T_{s} \) with \( H_{b} \times W_{b} \) shape is generated by image indices \( I_{{{{cam} \mathord{\left/ {\vphantom {{cam} {fusion}}} \right. \kern-0pt} {fusion}}}} \) and BEV indices \( I_{{{{bev} \mathord{\left/ {\vphantom {{bev} {fusion}}} \right. \kern-0pt} {fusion}}}} \). Finally, a feature multiplies the sparse tensor to generate the feature which can be fused by a camera feature map or an image feature map formulated as Eq. 3 and Eq. 4.

Fig. 3.
figure 3

Multi-view resize layer: it includes data preparation and bidirectional fusion.

$$ \left( {{\text{u}},{\text{v}}} \right)^{\text{T}} = {\text{M}} \cdot \left( {{\text{x}},{\text{y}},{\text{z}}} \right)^{\text{T}} , $$
(1)
$$ {\text{M}} = {\text{P}}_{\text{rect}} \cdot \left( {\begin{array}{*{20}c} {{\text{R}}_{\text{velo}}^{\text{cam}} } & {{\text{t}}_{\text{velo}}^{\text{cam}} } \\ 0 & 1 \\ \end{array} } \right), $$
(2)
$$ {\text{f}}_{{{\text{b}}2{\text{c}}}} = {\text{S}}\left( {{\text{Matmul}}\left( {{\text{T}}_{\text{s}}^{ - 1} ,{\text{f}}_{\text{bev}} } \right)} \right), $$
(3)
$$ {\text{f}}_{{{\text{c}}2{\text{b}}}} = {\text{R}}\left( {{\text{Matmul}}\left( {{\text{T}}_{\text{s}} ,{\text{G}}\left( {{\text{f}}_{\text{cam}} ,{\text{I}}_{{{{\text{cam}} \mathord{\left/ {\vphantom {{\text{cam}} {\text{fusion}}}} \right. \kern-0pt} {\text{fusion}}}}} } \right)} \right)} \right), $$
(4)

where (x, y, z) is a LiDAR point coordinate and (u, v) is image coordinate, \( \varvec{P}_{{\varvec{rect}}} \) is a project matrix, \( \varvec{R}_{{\varvec{velo}}}^{{\varvec{cam}}} \) is the rotation matrix from LiDAR to the camera, \( \varvec{t}_{{\varvec{velo}}}^{{\varvec{cam}}} \) is a translation vector, M is the homogeneous transformation matrix from LiDAR to the camera, S and G represent scatter operation and gather operation, respectively, Matmul means multiplication, R means reshape operation, \( {\text{f}}_{{{\text{b}}2{\text{c}}}} \) is the feature transferred from BEV to the camera, Conversely, \( {\text{f}}_{{{\text{c}}2{\text{b}}}} \) is the feature transferred from the camera to BEV.

The Feature Extractor.

The backbone network follows a two-stream architecture [22] to process multi-sensor data. Specifically, it uses two identical CNNs to extract features from both of 2D image and BEV representation in this paper. Each CNNs includes two parts: an encoder and a decoder. VGG-16 [23] is modified and simplified as the encoder. The convolutional layers from conv-1 to conv-4 are kept, and the channel number is reduced by half. In the feature extraction stage, the MVRL is used to resize two-side features. A little of information is retained for small classes such as cyclists and pedestrians in the output feature map. Therefore, inspired by FCNs [24] and Feature Pyramid Network (FPN) [25], a decoder is designed to up-sample the features back to the original input size. To fully fuse the view-specific features, The MVRL is used again to resize features after decoding. The final feature map has powerful semantics with a high resolution, and are fed into the 3D RPN and the refined subnetwork.

2.1.1 3D Region Proposal Network

3D Anchor Generation and Fusion.

Unlike MV3D [2], this paper directly generates 3D plane-based anchors like AVOD [15] and MMF [22]. The 3D anchors are parameterized by the centroid \( \left( {c_{x} ,c_{y} ,c_{z} } \right) \) and axis aligned dimensions \( \left( {d_{x} ,d_{y} ,d_{z} } \right) \). The \( \left( {c_{x} ,c_{y} } \right) \) pairs are sampled at intervals of 0.5 m in the BEV, while \( c_{z} \) is a fixed value that is determined according to the height of the sensor related to the ground plane. Since this paper does not regress the orientation at the 3D proposal stage, the \( \left( {d_{x} ,d_{y} ,d_{z} } \right) \) are transformed from (w, l, h) of the prior 3D boxes based on rotations. Furthermore, the (w, l, h) are determined by clustering the training samples for each class. For the car case, each location has two sizes of anchors. While each location only has one size of anchor for pedestrians and cyclists.

3D Proposal Generation.

AVOD [15] reduces the channel number of BEV and image feature maps to 1, and aims to process anchors with a small memory overhead. The truncated features are used to generate region proposals. However, the rough way loses most of the key features and causes proposal instability. To keep proposal stability and small memory overhead, we propose to apply a 1 ×1 convolutional kernel on the view-specific features output by the decoder, and the output number of channels is the same as the input. Each 3D anchor is projected onto the BEV and image feature maps output by the 1 ×1 convolutional layer to obtain two corresponding region-based features. Then, these features are cropped and resized to equal-length vectors, e.g. \( 3 \times 3 \). These fixed-length feature crop pairs from two views fused by concatenation operation.

Two similar branches [15] of 256-dimension fully connected layers take the fused feature crops as input to regress 3D proposal boxes and perform binary classification. The regression branch is to regress \( \left( {\Delta c_{x} ,\Delta c_{y} ,\Delta c_{z} ,\Delta d_{x} ,\Delta d_{y} ,\Delta d_{z} } \right) \) between anchors and target proposals. The classification branch is to determine an anchor to capture an object or background based on a score. Note that we divide all 3D anchors into negative, positive, ignore by projecting 3D anchors and corresponding ground-truth to BEV to calculate the 2D IoU between the anchors and the ground truth bounding boxes. For the car class, anchors with IoU less than 0.3 are considered negative anchors, while ones with IoU greater than 0.5 are considered positive anchors. Others are ignored. For the pedestrian and cyclist classes, the object anchor IoU threshold is reduced to 0.45. The ignored anchors do not contribute to the training objective [21].

The loss function in 3D proposal stage is defined as follows:

$$ Loss = \lambda L_{cls} + \gamma L_{box} , $$
(5)

where \( L_{cls} \) is the focal loss for object classification and \( L_{box} \) is the smooth l1 loss for 3D proposal box regression, \( \lambda \) = 1.0, \( \gamma \) = 5.0 are the weights to balance different tasks.

Followed by two task-specific branches, 2D non-maximum suppression (NMS) at an IoU threshold of 0.8 in BEV is used to remove redundant 3D proposals and the top 1,024 3D proposals are kept during the training stage. At inference time, 300 3D proposals are kept for the car class, and 1,024 3D proposals are used for cyclist and pedestrian class.

2.2 The Refined Network

The refined network aims to further optimize the detection based on the top K non-oriented region proposals and the features output by the two identical CNN to improve the final 3D object detection performance. First, the top K non-oriented region proposals are projected onto BEV and image feature maps output by feature extractors to obtain two corresponding region-based features. The region-based features are cropped and resized to 7 × 7 equal-length shapes. Then, the paired fixed-length crops from two views are fused with element-wise mean method. The fused features are fed into a three parallel fully connected layers for outputting bounding box regression, orientation estimation, and category classification, simultaneously. MV3D [2] proposes an 8-corner encoding, however, it does not take into account the physical constraints of a 3D bounding box. Like AVOD [15], a plane-based 3D bounding box is represented by a 10-dimensional vector to remove redundancy and keep the physical constraints. Ground truth boxes and 3D anchors are defined by \( \left( {x_{1} \cdots x_{4} ,y_{1} \cdots y_{4} ,h_{1} ,h_{2} ,\theta } \right) \). The corresponding regression residuals between 3D anchors and ground truth are defined as follows:

$$ \Delta {\text{x}} = \frac{{{\text{x}}_{\text{c}}^{\text{g}} - {\text{x}}_{\text{c}}^{\text{a}} }}{{{\text{d}}^{\text{a}} }},\Delta {\text{y}} = \frac{{{\text{y}}_{\text{c}}^{\text{g}} - {\text{y}}_{\text{c}}^{\text{a}} }}{{{\text{d}}^{\text{a}} }},\Delta {\text{h}} = \log \left( {\frac{{{\text{h}}^{\text{g}} }}{{{\text{h}}^{\text{a}} }}} \right), $$
(6)
$$ \Delta \theta = \sin \left( {\theta^{\text{g}} - \theta^{\text{a}} } \right), $$
(7)

where \( d^{a} = \sqrt {\left( {x_{1} - x_{2} } \right)^{2} + \left( {y_{4} - y_{1} } \right)^{2} } \) is the diagonal of the base of the anchor box.

The localization loss function and orientation loss function [7] as follows:

$$ {\text{L}}_{\text{box}} = \sum\nolimits_{{{\text{b}} \in \left( {{\text{x}}_{1} \cdots {\text{x}}_{4,} {\text{y}}_{1} \cdots {\text{y}}_{4,} {\text{h}}_{1} ,{\text{h}}_{2} ,\theta } \right)}} {{\text{Smooth}}_{{{\text{L}}1}} \left( {\Delta {\text{b}}} \right),} $$
(8)
$$ {\text{L}}_{\text{dir}} = \mathop \sum \nolimits {\text{Smooth}}_{{{\text{L}}1}} \left( {\Delta \theta } \right). $$
(9)

For the object classification loss, the focal loss is used:

$$ {\text{L}}_{\text{cls}} = - \alpha_{\text{a}} \left( {1 - {\text{p}}^{\text{a}} } \right)^{\gamma } \log \left( {{\text{p}}^{\text{a}} } \right), $$
(10)

where \( {\text{p}}^{\text{a}} \) is the class probability of an anchor, we set \( \alpha \) = 0:25 and \( \gamma \) = 2, the total loss for the refined network is, therefore,

$$ {\text{Loss}} = \frac{1}{{{\text{N}}_{\text{pos}} }}\left( {\beta_{1} {\text{L}}_{\text{box}} + \beta_{2} {\text{L}}_{\text{cls}} + \beta_{3} {\text{L}}_{\text{dir}} } \right), $$
(11)

Where \( {\text{N}}_{\text{pos}} \) is the number of positive anchors and \( \beta_{1} \) = 5.0, \( \beta_{2} \) = 1.0, \( \beta_{3} \) = 1.0.

In refined network, 3D proposals are only considered in the evaluation of the regression loss if they have at least a 0.65 2D IoU in bird’s eye view with the ground-truth boxes for the car class (0.55 for pedestrian/cyclist classes). NMS is used at a threshold of 0.01 to choose out the best detections.

3 Experiments and Results

3.1 Implementation Details

Due to the 2D RGB camera images are with different size, the images are center-cropped into a uniform size of 1200 × 360. Each point clouds are voxelized as a 700 × 800 × 6 BEV pseudo image. For data augmentation, it flips images, voxelized pseudo images, and ground-truth labels horizontally at the same time with a probability of 0.5 during the training. The DDF3D model is implemented by TensorFlow on one NVIDIA 1080 Ti GPU with a batch size of 1. Adam is the optimizer. The DDF3D model is trained for a total of 120K iterations with the initial learning rate of 0.0001, and decayed by 0.1 at 60K iterations and 90K iterations. The whole training process takes only 14 h, and the DDF3D model is evaluated from 80K iterations to 120K iterations every 5K iterations.

3.2 Quantitative Results

To showcase the superiority of the deep dense fusion method, this paper compares its approach with the existing state-of-the-art fusion methods (MV3D [2], AVOD [15], F-pointNet [17], ContFusion [20], MCF3D [16]) based on the RGB images and point clouds as inputs only. Table 1 shows the comparing results on the 3D and BEV performance measured by the AP. According to KITTI’s metric, the DDF3D increases 0.41% in 3D performance and 2.54% in BEV performance in the “Moderate” difficulty on the car class, respectively. For pedestrian/cyclist classes, DDF3D achieves 2.00% growth in BEV performance on the “Moderate” difficulty for pedestrian class and 1.04% growth in 3D performance on the “Moderate” difficulty for cyclist class. In the easy difficulty of 3D performance, DDF3D surpasses the second-best 1.50% for the pedestrian class and 1.01% for the cyclist, respectively. However, F-pointNet [17] is slightly better than DDF3D in BEV performance for cyclist. F-pointNet [17] utilizes the ImageNet weights to fine-tune its 2D detector, whereas DDF3D model is trained from scratch. Some 2D detection results in RGB images, 3D detection results are illustrated in Fig. 4.

Table 1. Comparison of the 3D Object and BEV performance of DDF3D with state-of-the-art 3D object detectors.
Fig. 4.
figure 4

Visualizations of DDF3D results on RGB images, point clouds.

3.3 Ablation Study

To analyze the effects of optimal deep dense fusion, an ablation is conducted on KITTI’s validation subset with massive experiments. Table 2 shows the effect of varying different combinations of the deep dense fusion method on the performance measured by the AP. As shown in Fig. 2, Each encoder has 4 convolution blocks in order: Conv1, Conv2, Conv3, Conv4. Each decoder also has 4 deconvolution blocks in order: Deconv1, Deconv2, Deconv3, Deconv4. To ensure the DDF3D high efficiency, the combinations of deep dense fusion are only designed shown in Table 2.

Table 2. Ablation study for the combinations of the deep dense method on KITTI’s validation subset. All results are in moderate difficulty in the car class.

To explore the effects of fusion method in different directions, two more sets of experiments are conducted based on the best combinations in Table 2. The first set of the experiment only projects features from BEV to the camera view. In contrast, the second set of the experiment only projects feature from camera view to BEV. Table 3 demonstrates that two-way fusion method is better than one-way fusion. The effect of different fusion methods on it is very limited for 2D and BEV performance, but they have a significant impact on the accuracy of 3D detection.

Table 3. Ablation study for the fusion method in different directions. B2C means the one-way fusion from BEV to the camera view. C2B means the one-way fusion from the camera view to BEV. Both mean bidirectional fusions.

Besides, the DDF3D model converges faster and the experimental values keep steadily after 80K iterations. Based on the attribute, the model can be checked good or not good within 12 h. Figure 5 shows the evaluation results are extracted every 5K iterations from 80K iterations to 120K iterations on the validation subset.

Fig. 5.
figure 5

3D detection accuracy of DDF3D for car class from 80K iterations to 120K iterations. The light coral color, medium aquamarine color, and Navajo white color denote the Easy, Moderate, Hard difficulty respectively. (Color figure online)

4 Conclusion

This work proposed DDF3D, a full fusion 3D detection architecture. The proposed architecture takes full consideration of the mutual advantages of RGB images and point clouds in the feature extraction phase. The deep dense fusion is two-directional fusion at the same time. A high-resolution feature extractor with the full fusion features, the proposed architecture greatly improves 3D detection accuracy, specifically for small objects. Massive experiments on the KITTI object detection dataset, DDF3D outperforms the state-of-the-art existing method in among of 2D, 3D, and BEV.