1 Introduction

An intelligent driving vehicle refers to a complex system that combines perception, decision-making, and control technologies. Environmental perception provides fundamental information for path planning, decision-making and control. Vehicle detection is an extremely important task in environmental perception systems of autonomous vehicle. At present, LiDAR and cameras are the mainstream of obstacle detection sensors. Cameras are widely used in intelligent driving, especially in traffic sign identification and lane recognition thanks to their low cost and capability to obtain the texture and color of objects.

During the past few years, 2D object detection from camera images has seen significant progress [1,2,3]. However, there is still large improvement potential when it comes to object localization in 3D space. As a camera is sensitive to light and shadows, it cannot provide accurate and sufficient positional information, which often results in low real-time performance and poor robustness. In contrast, LiDAR can obtain the distance and 3D information of a detection object, and it has been widely used in environmental perception. In general, there are two methods for processing the LiDAR point cloud spatially before the 3D information is used. The first method is to establish a 3D grid map on the LiDAR point cloud [4, 5] and then process the LiDAR point cloud on the grid map. Although the 3D grid representation preserves most of the raw information of the point cloud, it usually requires much more complex computation for subsequent processing. The second method is to project the LiDAR point cloud into 2D space [6], which can reduce the amount of calculations. Although the LiDAR-based algorithm is widely used in target detection, the resolution of the LiDAR point cloud decreases as the detection distance increases.

As a single sensor cannot meet the needs of autonomous driving, the application of multi-sensor fusion schemes that include both cameras and LiDAR in intelligent vehicles has been gradually increasing. Multi-sensor data fusion takes full use of the data collected by multiple sensors. At present, according to the level of information processing, fusion systems are divided into three levels: (1) Pixel-level fusion [7, 8] integrates the collected data directly and then extracts feature vectors from the fused data to identify the detected objects. The data for fusion that have not been processed lead to an enormous amount of calculations. (2) Feature-level fusion [9, 10] extracts representative features from the data collected by each sensor and fuses the features into a single feature vector for processing. Because of the abandonment of a portion of the data, the accuracy is reduced. (3) Decision-level fusion [11] is based on the independent detection and classification of each sensor. It makes an optimal global decision by integrating the recognition results of multiple sensors.

In this paper, the proposed network architecture fuses the LiDAR point cloud and RGB image to achieve high performance in autonomous vehicles. Firstly, the LiDAR point cloud is projected to the BEV (bird’s-eye view of the LiDAR point cloud). Then the processed LiDAR point cloud and RGB image are used as network input. Inspired by the idea of FPN (feature pyramid network) [12], a new feature extractor structure is proposed to generate a high-resolution feature map from the LiDAR point cloud and RGB image that has high detection performance for objects. Secondly, the high-resolution feature map is fused to generate reliable 3D vehicle proposals. Further, ROI (region of interest) pooling [3] for each feature map is employed to obtain equal-length feature vectors and the ROI pooling feature map is fused using an element-wise mean operation [13]. Thirdly, 3D box regression is performed to predict the extent and orientation of vehicles in 3D space. The architectural diagram of the proposed fusion method is shown in Fig. 1.

Fig. 1
figure 1

Architecture of the proposed fusion method

The contributions of this research can be summarized as follows: (1) A new vehicle detection method is proposed based on the fusion of a feature map that is generated from LiDAR point cloud and RGB images. (2) A new feature extractor is proposed that can generate a high-resolution feature map which is suitable for subsequent processing. (3) A new 3D bounding box is proposed to predict the extent and orientation of a vehicle.

2 3D Vehicle Detection Method Architecture

2.1 3D Point Cloud Representation

That front vehicles occlude rear vehicles in the front view of the LiDAR point cloud affects the detection result. Therefore, to retain the information from the LiDAR point cloud data more effectively, a more compact representation of the LiDAR point cloud is proposed by projecting the 3D point cloud onto a BEV map. The BEV map is encoded according to height and density, and is represented by a 2D grid with a resolution of 0.1 m. As a 3D grid representation requires complex and extensive computation for feature extraction and the aim is to obtain more detailed information of the detection vehicle [14], the LiDAR position is set as the center, with the maximal left and right positions set to [− 40 m, 40 m] and the front position to [0, 70 m] on the BEV map to align with the image detection range. The scope of the cropped LiDAR point cloud in the BEV map is shown in Fig. 2.

Fig. 2
figure 2

Scope of the cropped LiDAR point cloud in the BEV map

According to the actual physical height of a vehicle, the point cloud along the Z-axis [0, 2.5 m] is divided into five equal height channels. Each channel is projected onto the 2D ground grid (Z = 0) and is encoded with the maximum height of the points in each grid cell. The point cloud density is set to be the sixth channel, which refers to the number of points in each cell, and the value of each cell is normalized as follows:

$${ \hbox{min} }\left( {1.0,\frac{\ln (N + 1)}{\ln 16}} \right)$$
(1)

where N is the number of points in the cell.

2.2 Feature Map Generation and Feature Extraction

Inspired by the FPN which has become the key component of 2D object detection, features are extracted from the BEV map and the RGB image, respectively, to recognize and locate the vehicles. The bottom-level semantic information of an image is poor, whereas the physical information is accurate; moreover, the high-level semantic information is rich, whereas the physical information is not sufficiently abundant. Therefore, to make full use of the information of the original bottom-level feature map, an \(1 \times 1\) convolution operation is performed to fuse the bottom-level feature with the high-level feature so that all the scales of the feature map have rich semantic information and physical information and the final feature map is suitable for subsequent processing.

The feature extractors are based on the VGG-16 architecture [15]. Assuming that the input sizes of both RGB image and BEV map are \(H \times W \times D\), the first four convolution layers of the VGG-16 network for down-sampling are used, which results in a feature map output eight times smaller than the corresponding input. Hence, the output size of the feature map is \(\frac{H}{8} \times \frac{W}{8} \times 256\). An \(1 \times 1\) convolution layer is used to reduce the number of channel dimensions. The up-sampled map is then merged with the corresponding down-sampled map by element-wise addition [16]. Finally, a \(3 \times 3\) convolution layer on each merged map is employed to generate the final feature map to reduce the aliasing effect. A feature map with high resolution and high semantic information is then obtained, and the size of this final feature map is \(\frac{H}{2} \times \frac{W}{2} \times 256\). The feature extraction framework is shown in Fig. 3.

Fig. 3
figure 3

Feature exaction framework

2.3 3D Proposal Network Design

Inspired by the idea of the RPN (region proposal network) which is an important component of the Faster R-CNN [3] network for 2D object detection, a 3D proposal network is designed to generate 3D proposals for the prediction of the vehicle orientation and extent. 3D box proposals from a set of 3D anchor boxes are generated to cover most of the vehicles in 3D space. Each 3D anchor box is parameterized by (x, y, z, l, w, h), where triplets (x, y, z) denote the center of the 3D anchor box and triplets (l, w, h) represent the size of the 3D predicted box. In addition, (l, w) of the anchor box takes values of (3.8 m, 1.6 m) and (1.0 m, 0.6 m), and the height h is fixed to 1.63 m. By rotating the 3D anchor by 90° and sampling the 3D anchor boxes at intervals of 0.5 m in the BEV map and RGB image, respectively, a total of 44,800 anchors are finally generated. Because the LiDAR point cloud is sparse, most of the 3D anchor boxes are empty. The empty anchors are removed to reduce the amount of calculations, and the final number of anchors is kept between 8000 and 15,000. Because features of the BEV and RGB image feature maps have different resolutions, the ROI pooling for each view is employed to obtain feature vectors of the same length. Given a generated 3D anchor box, the anchor is projected onto the BEV and RGB image feature maps, and the output of the ROI pooling feature maps is fused using an element-wise mean operation.

A binary label is assigned to each anchor that shows whether it is an object or background. By calculating the IOU (intersection over union) [17] between the anchor and the ground-truth bounding box, a positive label is assigned to two types of anchors: In the first type, the IOU determined by the anchors and ground-truth bounding box is greater than 0.5, and in the second type, the IOU between the anchor generated at the same point and the ground-truth bounding box is the highest, even though it is less than 0.5. A ground-truth bounding box can assign positive labels to multiple anchors. A negative label assigned to an anchor with the IOU is less than 0.3 for all ground-truth bounding boxes. However, non-positive and nonnegative anchors have no effect on training objects, and are ignored in subsequent processing. Using the fused feature map to regress the anchor box, 3D box regression is used to generate the 3D proposals. A multitask loss is used to simultaneously classify vehicle/background and 3D box regression, Smooth \(L_{1}\) loss is used for the 3D box regression, and class entropy loss is used for determining whether the anchor is positive or negative. The total loss L is as follows:

$$\left\{ \begin{aligned} & L_{{{\text{cls}}}} \left( {p_{i} ,p_{i}^{*} } \right) = - \left[ {\left( {p_{i} \ln \left( {p_{i}^{*} } \right)} \right) + \left( {1 - p_{i} } \right){\text{ln}}\left( {1 - p_{i}^{*} } \right)} \right] \\ & L_{{{\text{reg}}}} \left( {\user2{t}_{\user2{i}} ,\user2{t}_{\user2{i}}^{\user2{*}} } \right) = \left\{ {\begin{array}{*{20}c} {0.5\left( {\user2{t}_{\user2{i}} - \user2{t}_{\user2{i}}^{\user2{*}} } \right)^{2} } & {\left| {\user2{t}_{\user2{i}} - \user2{t}_{\user2{i}}^{\user2{*}} } \right| \le 1} \\ {\left| {\user2{t}_{\user2{i}} - \user2{t}_{\user2{i}}^{\user2{*}} } \right| - 0.5} & {\left| {\user2{t}_{\user2{i}} - \user2{t}_{\user2{i}}^{\user2{*}} } \right| > 1} \\ \end{array} } \right. \\ & L = \frac{1}{N}\mathop \sum \limits_{{i = 1}}^{{N_{{{\text{cls}}}} }} L_{{{\text{cls}}}} \left( {p_{{\text{i}}} ,p_{i}^{*} } \right) + \frac{1}{N}\mathop \sum \limits_{{i = 1}}^{{N_{{{\text{reg}}}} }} p_{i}^{*} L_{{{\text{reg}}}} \left( {\user2{t}_{\user2{i}} ,\user2{t}_{\user2{i}}^{\user2{*}} } \right) \\ \end{aligned} \right.$$
(2)

where \(L_{\text{cls}}\) is the class entropy loss, \(L_{\text{reg}}\) is the Smooth\(L_{1}\)loss, \(p_{i}\) is the probability of an anchor predicted as an object, ground-truth label \(p_{i}^{*}\) is 1 if the anchor is positive and is 0 if the anchor is negative, N is the number of anchor, \(\varvec{t}_{\varvec{i}} = \left( {t_{x} ,t_{y} ,t_{z} ,t_{l,} t_{w,} t_{h} } \right)\) is the offset of the predict box relative to the 3D anchor box, and \(\varvec{t}_{\varvec{i}}^{\varvec{*}} = \left( {t_{x}^{*} ,t_{y}^{*} ,t_{z}^{*} ,t_{l}^{*} ,t_{w}^{*} ,t_{h}^{*} } \right)\) is the offset of the ground-truth box relative to the 3D anchor box. The calculation is expressed as follows:

$$\left\{ \begin{aligned} t_{x} = & \frac{{x_{\text{g}} - x_{\text{a}} }}{{d_{\text{a}} }}; t_{y} = \frac{{y_{\text{g}} - y_{a} }}{{d_{\text{a}} }}; t_{z} = \frac{{z_{\text{g}} - z_{\text{a}} }}{{h_{a} }} \\ t_{l} = & \lg \left( {\frac{{l_{g} }}{{l_{\text{a}} }}} \right); t_{w} = \lg \left( {\frac{{w_{\text{g}} }}{{w_{\text{a}} }}} \right); t_{h} = \lg \left( {\frac{{h_{\text{g}} }}{{h_{\text{a}} }}} \right) \\ t_{x}^{*} = & \frac{{x_{\text{p}} - x_{\text{a}} }}{{d_{\text{a}} }}; t_{y}^{*} = \frac{{y_{\text{p}} - y_{\text{a}} }}{{d_{\text{a}} }}; t_{z}^{*} = \frac{{z_{\text{p}} - z_{\text{a}} }}{{h_{\text{a}} }} \\ t_{l}^{*} = & \lg \left( {\frac{{l_{\text{p}} }}{{l_{\text{a}} }}} \right); t_{w}^{*} = \lg \left( {\frac{{w_{\text{p}} }}{{w_{\text{a}} }}} \right); t_{h}^{*} = \lg \left( {\frac{{h_{\text{p}} }}{{h_{\text{a}} }}} \right) \\ d_{\text{a}} = & \sqrt {\left( {l_{\text{a}} } \right)^{2} + \left( {w_{\text{a}} } \right)^{2} } \\ \end{aligned} \right.$$
(3)

where (\(x_{\text{g}} ,y_{\text{g}} ,z_{\text{g}} ,l_{\text{g}} ,w_{\text{g}} ,h_{\text{g}}\)) is the ground-truth box, (\(x_{\text{a}} ,y_{\text{a}} ,z_{\text{a}} ,l_{\text{a}} ,w_{\text{a}} ,h_{\text{a}}\)) is the 3D anchor box, \((x_{\text{p}} ,y_{\text{p}} ,z_{\text{p}} ,l_{\text{p}} ,w_{\text{p}} ,h_{\text{p}}\)) is the predict box, and \(d_{a}\) is the diagonal length of the 3D anchor box. NMS (non-maximum suppression) at a threshold of 0.8 on the BEV map is applied to retain the top-1000 proposals during training, and the top-300 proposals are used only in testing. The 3D proposals are projected onto the RGB image, as shown in Fig. 4.

Fig. 4
figure 4

Projection of the 3D proposals onto the RGB image

2.4 Region Proposal Fusion

Information fusion is a technology that is used to integrate and optimize a variety of information, and it retains useful information according to the inherent connections and rules of information. In this paper, the feature-level fusion method is employed.

A fusion network is designed to effectively combine features from the BEV map and RGB image map that jointly performs oriented 3D box regression. Because features from the BEV map and RGB image map have different resolutions, the ROI pooling on the feature map of each box proposal is performed to resize it to \(7 \times 7\) to obtain equal-length feature vectors; and then the pooling feature map is fused using element-wise mean operation. The fused features are as follows:

$$F_{L} = H_{L} \left( {H_{L - 1} \left( { \ldots H_{1} \left( {F_{\text{BEV}} + F_{\text{RGB}} } \right)} \right)} \right)$$
(4)

where \(F_{L}\) is the fused feature, \(H_{L}\) is the feature transformation function of layer L, \(F_{\text{BEV}}\) is the feature of the BEV map, and \(F_{\text{RGB}}\) is the feature of the RGB image map.

2.5 3D Box Regression

Given the fused features of the fusion network, a further regression operation is required to determine the orientation and classification of each proposal. Hence, the oriented 3D boxes are regressed from the 3D proposals. In particular, the bounding box is encoded using its length, width, center, and two heights, so the regression targets are encoded by \(\left( {\Delta x,\Delta y,\Delta d_{x} ,\Delta d_{y} ,\Delta h_{1} ,\Delta h_{2} } \right)\). The encoded bounding box is shown in Fig. 5. Compared to the 8-corner box encoding proposed in [14], only six vectors are needed to represent the oriented 3D box, so the proposed encoding procedure reduces the box representation from an overparameterized 24-dimensional vector to a six-dimensional one, which further reduces the redundancy while keeping the physical constraints. A multitask loss is used to simultaneously classify the predicted 3D box as a vehicle or background, and 3D box regression is performed using the Smooth L1 loss for the 3D box regression and class entropy loss for the classification task.

Fig. 5
figure 5

Encoded 3D bounding box

2.6 KITTI Dataset

In this paper, the proposed network uses the KITTI dataset [18] for training and verification. The collected data scenes are diverse, from highway scenes to countryside scenes. It contains eight obstacle types: cars, vans, trucks, pedestrians, pedestrians (sitting), cyclists, trams, and others. All cars, vans, and trucks are treated as vehicles in this paper.

The dataset mainly consists of three parts: (1) RGB images collected by a camera; (2) LiDAR point clouds collected by a Velodyne HDL-64E laser scanner, which include information about the coordinates (x, y, z) in the LiDAR’s coordinate system and reflection intensity of the LiDAR point cloud (about 1.3 million LiDAR cloud points per frame were collected); and (3) the calibration files, which describe the relationship between the camera coordinate system and LiDAR coordinate system. The dataset consists of 7481 training sets and 7518 verification sets.

2.7 Training

This work is based on the known coordinate relationship between the LiDAR point cloud and RGB image. The obtained 7481 training sets are split into two parts [4], resulting in 3712 data samples for training and 3769 data samples for validation.

The main parameters of the experimental platform are as follows: The processor is an Intel(R) core (TM) i5-8600 K CPU@3.60 GHz, the memory is 64 GB, and the graphics card is NVIDIA GeForce GTX1080. The network is trained by the ADAM optimizer through 100,000 global steps at an initial learning rate \(L_{I}\) of 0.001. The decay learning rate \(L_{D}\) is exponentially reduced at every 20,000 decay steps with a decay rate of 0.8, and the decay learning rate is expressed as follows:

$$L_{D} = L_{I} \times {\text{decay}}\_{\text{rate}}^{{\left( {\frac{{{\text{new}}\_{\text{step}}}}{20000}} \right)}}$$
(5)

where new step is the epoch of training until the epoch is equal to the global step.

3 Experimental Results

The KITTI dataset is used to evaluate the detection performance of the proposed method. The test results are evaluated based on three levels: easy, moderate, and difficult. An image is an easy image if the vehicles are fully visible and the maximum occlusion rate is 15%; it is a moderate image if the vehicles are partly occluded and the maximum occlusion rate is 30%; it is a difficult image if the vehicles are difficult to see and the maximum occlusion rate is 50%.

The proposed method is compared with several top-performing algorithms: a LiDAR-based approach, RT3D [19], an RGB image-based approach, Stereo R-CNN [20], and a fusion of LiDAR point cloud- and RGB image-based approach A3DODWTD [21]. The runtime and average precision of different methods are analyzed, in which average precision is defined as follows:

$$\left\{ \begin{aligned} P = & \frac{{TP}}{{{{TP}} + {{FP}}}} \\ R = & \frac{{TP}}{{{{TP}} + {{FN}}}} \\ {{AP}} = & \frac{1}{11}\mathop \sum \limits_{{r \in \left\{ {0,0.1, \ldots 1} \right\}}} \hbox{max} [P(R > r)] \\ \end{aligned} \right.$$
(6)

where P is the precision, R is the recall, AP is the average precision, TP is the number of 3D boxes that are correctly predicted to be vehicles, \({{FP}}\) is the number of 3D boxes for which the background is predicted to be a vehicle, and \({{FN}}\) is the number of 3D boxes for which a vehicle is predicted to be background. The results are compared in Table 1.

Table 1 Comparison results of several top-performing methods [19,20,21]

To verify the detection and real-time performances of the proposed vehicle detection method, two test scenarios with different levels of difficulty are selected. The detection results for a simple scene are presented in the left images of Fig. 6. In this scene, only a few cars are on the road, where the light is poor, and the tree shadows almost cover the right vehicle. The detection results for a difficult scene are shown in the right images of Fig. 6. In this scene, the road is full of vehicles on both sides that are occluding each other.

Fig. 6
figure 6

3D vehicle detection in an easy scene (left) and difficult scene (right): a original RGB image; b 3D detection result on the RGB image; c detection result on BEV map

3.1 Evaluation in 3D Detection

Compared to 2D vehicle detection, 3D vehicle detection is more challenging. The comparison results show that the proposed method in this paper significantly outperforms other approaches with respect to the metric \({{AP}}_{{3{\text{D}}}}\). Specifically, the proposed method significantly outperforms the fusion method A3DODWTD by 11.78%, 4.37%, and 2.88% on easy, moderate, and difficult images, respectively.

3.2 Evaluation in BEV Detection

The evaluation result is presented in Table 1 for \({{AP}}_{\text{BEV}}\). The proposed method consistently outperforms the compared approaches, and it is obviously better than the image-based detection. The important reason for this performance is that vehicles are occluded in the image, so these vehicles will not be detected.

CenterNet [22] is based on image detection, and Fig. 7 shows a comparison of the proposed method and CenterNet with respect to 3D detection and BEV detection.

Fig. 7
figure 7

Qualitative comparison of the proposed method (left) and CenterNet (right) 3D detection results: a RGB image detection; b BEV detection

3.3 AOS (Average Orientation Similarity) Evaluation

To evaluate the performance of orientation regression, the AOS is used according to the method proposed in [23]. AOS is defined as follows:

$$\left\{ \begin{aligned} s\left( r \right) = & \frac{1}{{\left| {D\left( r \right)} \right|}}\mathop \sum \limits_{{i \in D\left( r \right)}} \frac{{1 + \cos \Delta _{{\theta _{i} }} \delta _{i} }}{2},~\left\{ {\begin{array}{*{20}c} {\delta _{i} = 1,} & {{\text{IOU}} \le 0.5} \\ {\delta _{i} = 0,~} & {{\text{IOU}} > 0.5} \\ \end{array} } \right. \\ {\text{AOS}} = & \frac{1}{{11}}\mathop \sum \limits_{{r \in \left\{ {0,0.1, \ldots 1} \right\}}} \max ~s(R > r) \\ \end{aligned} \right.$$
(7)

where \(\Delta_{{\theta_{i} }}\) is the difference between the prediction angle and the ground-truth angle of vehicle i, and R is the recall.

Table 2 shows the AOS performance of different methods reported on the KITTI online evaluation. The AOS of the method proposed in this research outperforms that of the other methods, which well illustrates the advantage of the method.

Table 2 AOS of different methods (%)

The experimental results show that the proposed vehicle detection method achieves good detection results in scenes with different levels of difficulty. Moreover, as given in Table 1, the runtime of the proposed method is 0.12 s, so it could be used as a basis for further research in autonomous vehicles.

4 Conclusion

In this paper, a new method is proposed for 3D vehicle detection based on the fusion of data collected by LiDAR and a camera. The model takes advantage of both the LiDAR point cloud and RGB images. The RPN is improved to obtain the 3D proposals according to the BEV map and RGB image map. Furthermore, a fusion network is presented to fuse the information and perform 3D box regression. An experiment on the KITTI dataset is conducted to verify the detection performance. The experimental results show that the proposed method for 3D vehicle detection is superior to the existing related methods based on LiDAR and camera data. The proposed method achieves good real-time and reliability performance in the experiment.

However, in the proposed method, the feature fusion of the BEV and RGB image is relatively simple, which could lead to insufficient use of feature information. Therefore, in the future work, the proposed network needs to be optimized to achieve better 3D vehicle detection results.