Keywords

1 Introduction

3D object detection is an essential component of scene perception and motion prediction in autonomous driving [2, 9]. Currently, most powerful 3D detectors heavily rely on 3D LiDAR laser scanners for the reason that it can provide scene locations [8, 29, 41, 46]. However, the LiDAR-based systems are expensive and not conducive to embedding into the current vehicle shape. In comparison, monocular camera devices are cheaper and convenient which makes it drawing an increasing attention in many application scenarios [6, 26, 40]. In this paper, the scope of our research lies in 3D object detection from only monocular RGB image.

Fig. 1.
figure 1

Overview of proposed method: We first predict ordinal keypoints projected in the image space by eight vertexes and a central point of a 3D object. We then reformulate the estimation of the 3D bounding box as the problem of minimizing the energy function by using geometric constraints of perspective projection.

Monocular 3D object detection methods can be roughly divided into two categories by the type of training data: one imposes complex features, such as instance segmentation, category-specific shape prior and even depth map to select best proposals in multi-stage fusion module [6, 7, 40]. These features require additional annotation work to train some stand-alone networks which will consume plenty of computing resources in the training and inferring stages. Another one only employs 2D bounding box and properties of a 3D object as the supervised data [3, 20, 33, 42]. In this case, the most straightforward way build a deep regression network to directly predict the 3D information of the object, which caused the performance bottlenecks due to the large search space. To address this challenge, recent works have clearly pointed out that apply geometric constraints from 3D box vertexes to 2D box edges to refine or directly predict object parameters [3, 20, 23, 26, 28]. However, four edges of a 2D bounding box provide only four constraints on recovering a 3D bounding box while each vertex of a 3D bounding box might correspond to any edges in the 2D box, which will takes 4,096 of the same calculations to get one result [26]. Meanwhile, the strong reliance on the 2D box causes a sharp decline in 3D detection performance when predictions of 2D detectors even have a slight error. Therefore, most of these methods take advantage of two-stage detectors [10, 11, 32] to ensure the accuracy of 2D box prediction, which limit the upper-bound of the detection speed.

Table 1. Comparison of the real-time status and the requirements of additional data in different image-based detection approaches.

In this paper, we propose an efficient and accurate monocular 3D detection framework in the form of one-stage, which be tailored for 3D detection without relying on extra annotations, category-specific 3D shape priors, or depth maps. The framework can be divided into two main parts, as shown in Fig. 1. First, we perform a one-stage fully convolutional architecture to efficiently predict 9 of the 2D keypoints which are projected points from 8 vertexes and central point of 3D bounding box. This 9 keypoints provides 18 geometric constrains on the 3D bounding box. Inspired by CenterNet [45], we model the relationship between the eight vertexes and the central point to solve the keypoints grouping and the vertexes order problem. The SIFT, SUFT and other traditional keypoint detection methods [1, 24]computed an image pyramid to solve the scale-invariant problem. A similar strategy was used by CenterNet as a post-processing step to further improve detection accuracy, which slows the inference speed. Note that the Feature Pyramid Network (FPN) [21] in 2D object detection is not applicable to the network of keypoint detection, because adjacent keypoints may overlap in the case of small-scale prediction. We propose a novel multi-scale pyramid of keypoint detection to generate a scale-space response. The final activate map of keypoints can be obtained by means of the soft-weighted pyramid. Given the 9 projected points, the next step is to minimize the reprojection error over the perspective of 3D points that parameterized by the location, dimension, and orientation of the object. We formulate the reprojection error as the form of multivariate equations in \(\mathfrak {se}_3\) space, which can generate the detection results accurately and efficiently. We also discuss the effect of different prior information, such as dimension, orientation, and distance, predicted in parallel from our keypoint detection network. The prerequisite for obtaining this information is not to add too much computation so as not to affect the final detection speed. We model these priors and reprojection error term into an overall energy function in order to further improve 3D estimation.

To summarize, our main contributions are the following:

  • We formulate the monocular 3D detection as the keypoint detection problem and combine the geometric constrains to generate properties of 3D objects more efficiently and accurately.

  • We propose a novel one-stage and multi-scale network for 3D keypoint detection which provide the accurate project points for multi-scale object.

  • We propose an overall energy function that can jointly optimize the prior and 3D object information.

  • Evaluation on the KITTI benchmark, We are the first real-time 3D detection method using only images and achieves better accuracy under the same running time in comparing other competitors.

2 Related Work

Extra Data or Network for Image-based 3D Object Detection. In the last years, many studies develop the 3D detection in an image-based method for the reason that camera devices are more convenient and much cheaper. To complement the lacked depth information in image-based detection, most of the previous approaches heavily relied on the stand-alone network or additional labeling data, such as instance segmentation, stereo, wire-frame model, CAD prior, and depth, as shown in Table 1. Among them, monocular 3D detection is a more challenging task due to the difficulty of obtaining reliable 3D information from a single image. One of the first examples [6] enumerate a multitude of 3D proposals from pre-defined space where the objects may appear as the geometrical heuristics. Then it takes the other complex prior, such as shape, instance segmentation, contextual feature, to filter out dreadful proposals and scoring them by a classifier. To make up for the lack of depth, [40] embed a pre-trained stand-alone module to estimate the disparity. The disparity map concatenates the front view representation to help the 2D proposal network and the 3D detection can be boosted by fusing the extracted feature after RoI pooling and point cloud. As a followup, [25] combines the 2D detector and monocular depth estimation model to obtain the 2D box and corresponding point cloud. The final 3D box can be obtained by the regression of PointNet [30] after the aggregation of the image feature and 3D point information through attention mechanism, which achieves the best performance in the monocular image. Intuitively, these methods would certainly increase the accuracy of the detection, but the additional network and annotated data would lead to more computation and labor-intensive work.

Image-only in Monocular 3D Object Detection. Recent works have tried to fully explore the potency of RGB images for 3D detection. Most of them include geometric constraints and 2D detectors to explicitly describe the 3D information of the object. [26] uses CNN to estimate the dimension and orientation extracted feature from the 2D box, then it proposes to obtain the location of an object by using the geometric constraints of the perspective relationship between 3D points and 2D box edges. This contribution is followed by most image-based detection methods either in refinement step or as direct calculation on 3D objects [3, 20]. All we know in this constraint is that certain 3D points are projected onto 2D edges, but the corresponding relationship and the exact location of the projection are not clear. Therefore, it needs to exhaustively enumerate \(8^4=4096\) configurations to determine the final correspondence and can only provide four constraints, which is not sufficient for fully 3D representation in 9 parameters. It led to the need to estimate other prior information. Nevertheless, possible inaccuracies in the 2D bounding boxes may result in a grossly inaccurate solution with a small number of constraints. Therefore, most of these methods obtain more accurate 2D box through a two-stage detector, which is difficult to get real-time speed.

Keypoints in Monocular 3D Object Detection. It is believed that the detection accuracy of occluded and truncated objects can be improved by deducing complete shapes from vehicle keypoints [5, 27, 44]. They represent the regular-shape vehicles as a wire-frame template, which is obtained from a large number of CAD models. To train the keypoint detection network, they need to re-label the data set and even use depth maps to enhance the detection capability. [13] is most related to our work, which also considers the wire-frame model as prior information. Furthermore, It jointly optimizes the 2D box, 2D keypoints, 3D orientation, scale hypotheses, shape hypotheses, and depth with four different networks. This has limitations in run time. In contrast to prior work, we reformulate the 3D detection as the coarse keypoints detection task. Instead of predicting the 3D box based on an off-the-shelf 2D detectors or other data generators, we build a network to predict 9 of 2D keypoints projected by vertexes and center of 3D bounding box while minimize the reprojection error to find an optimal result.

3 Proposed Method

In this section. We first describe the overall architecture for keypoint detection and prior property prediction. Then we detail how to estimate the 3D bounding box of the object by maintaining 2D-3D consistency.

Fig. 2.
figure 2

An overview of proposed keypoint detection architecture: It takes only the RGB images as the input and outputs main center heatmap, vertexes heatmap, and vertexes coordinate as the base module to estimate 3D bounding box. It can also predict other alternative priors to further improve the performance of 3D detection.

3.1 Keypoint Detection Network

Our keypoint detection network takes an only RGB image as the input and simultaneously generates 2D-related information, such as perspective points and 2D size, and 3D-related information, such as dimension, orientation and distance. As shown in Fig. 2, it consists of three components: backbone, keypoint feature pyramid, and detection head. The main architecture adopts a one-stage strategy that shares a similar layout with the anchor-free 2D object detector [15, 18, 36, 45], which allows us to get a fast detection speed. Details of the network are given below.

Backbone. For the trade-off between speed and accuracy, we use two different structures as our backbones: ResNet-18 [12] and DLA-34 [43]. All models take a single RGB image \(I \in \mathbb {R}^{W \times H \times 3}\) and downsample the input with factor \(S=4\). The ResNet-18 and DLA-34 build for image classification network, the maximal downsample factor is \(\times 32\). We upsample the bottleneck thrice by three bilinear interpolations and \(1 \times 1\) convolutional layer. Before the upsampling layers, we concatenate the corresponding feature maps of the low level while adding one \(1\times 1\) convolutional layers for channel dimension reduction. After three upsampling layers, the channels are 256, 128, 64, respectively.

Keypoint Feature Pyramid. Keypoint in the image have no difference in size. Therefore, the keypoint detection is not suitable for using the Feature Pyramid Network (FPN) [21], which detect multi-scale 2D box in different pyramid layers. We propose a novel approach Keypoint Feature Pyramid Network (KFPN) to detect scale-invariant keypoints in the point-wise space, as shown in Fig. 3. Assuming we have F scale feature maps, we first resize each scale \(f, 1<f<F\) back to the size of maximal scale, which yields the feature maps \(\hat{f}_{1<f<F}\). Then, we generate soft weight by a softmax operation to denote the importance of each scale. The finally scale-space score map \(S_{score}\) is obtained by linear weighing sum. In detail, it can be defined as:

$$\begin{aligned} \begin{array}{lr} \begin{aligned} S_{score}= \sum \limits _f{\hat{f} {\odot } softmax(\hat{f})} \end{aligned} \end{array} \end{aligned}$$
(1)

where \(\odot \) denote element-wise product.

Fig. 3.
figure 3

Illustration of our keypoint feature pyramid network (KFPN).

Detection Head. The detection head is comprised of three fundamental components and six optional components which can be arbitrarily selected to boost the accuracy of 3D detection with a little computational consumption. Inspired by CenterNet [45], we take a keypoint as the maincenter for connecting all features. Since the 3D projection point of the object may exceed the image boundary in the case of truncation, the center point of the 2D box will be selected more appropriately. The heatmap can be define as \(M \in [0,1]^{\frac{H}{S}\times {\frac{W}{S} \times C}}\), where C is the number of object categories. Another fundamental component is the heatmap \(V \in [0,1]^{\frac{H}{S}\times {\frac{W}{S} \times 9}}\) of nine keypoints projected by vertexes and center of 3D bounding box. For keypoints association of one object, we also regress an local offset \(V_{c} \in R^{\frac{H}{S}\times {\frac{W}{S} \times 18}}\) from the maincenter as an indication. Keypoints of V closest to the coordinates from \(V_{c}\) are taken as a group of one object.

Although the 18 constraints by the 9 keypoints have an ability to recover the 3D information of the object, more prior information can provide more constraints and further improve the detection performance. We offer a number of options to meet different needs for accuracy and speed. The center offset \(M_{os} \in R^{\frac{H}{S}\times {\frac{W}{S} \times 2}}\) and vertexes offset \(V_{os} \in R^{\frac{H}{S}\times {\frac{W}{S} \times 2}}\) are discretization error for each keypoint in heatmaps. The dimension \(D \in \mathbb {R}^{\frac{H}{S}\times {\frac{W}{S} \times 3}}\) of 3D object have a smaller variance, which makes it easy to predict. The rotation \(R(\theta )\) of an object only by parametrized by orientation \(\theta \) (yaw) in the autonomous driving scene. We employ the Multi-Bin based method [26] to regress the local orientation. We classify the probability with cosin and sine offset of the local angle in one bin, which generates feature map of orientation \(O \in \mathbb {R}^{\frac{H}{S}\times {\frac{W}{S} \times 8}}\) with two bins. We also regress the depth \(Z \in \mathbb {R}^{\frac{H}{S}\times {\frac{W}{S} \times 1}}\) of 3D box center, which can be used as the initialization value to speed up the solution in Sect. 3.2.

Training. All the heatmaps of keypoint and maincenter training strategy follow the [18, 45]. The loss solves the imbalance of positive and negative samples with focal loss [22]:

$$\begin{aligned} \begin{array}{lr} L_{kp}^{K} = -\frac{1}{N} \sum \limits _{k=1}^{K}\sum \limits _{x=1}^{H/S}\sum \limits _{y=1}^{W/S} \left\{ \begin{aligned} (1-\hat{p}_{kxy})^\alpha log(\hat{p}_{kxy}) \quad \quad \quad &{}if \quad p_{kxy}=1\\ (1-p_{kxy})^\beta \hat{p}_{kxy} log(1-\hat{p}_{kxy}) \quad &{}otherwise \end{aligned} \right. \end{array} \end{aligned}$$
(2)

where K is the channels of different keypoints, \(K=C\) in maincenter and \(K=9\) in keypoints. N is the number of maincenter or keypoints in an image, and \(\alpha \) and \(\beta \) are the hyper-parameters to reduce the loss weight of negative and easy positive samples. We set is \(\alpha =2\) and \(\beta =4\) in all experiments following [18, 45]. \(p_{kxy}\) can be defined by Gaussian kernel \(p_{xy}=exp\left( - \frac{x^2+y^2}{2 \sigma } \right) \) centered by ground truth keypoint \(\tilde{p}_{xy}\). For \(\sigma \), we find the max area \(A_{max}\) and min area \(A_{min}\) of 2D box in training data and set two hyper-parameters \(\sigma _{max}\) and \(\sigma _{min}\). We then define the \(\sigma =A(\frac{\sigma _{max}-\sigma _{min}}{A_{max}-A_{min}})\) for a object with size A. For regression of dimension and distance, we define the residual term as:

$$\begin{aligned} \begin{array}{lr} \begin{aligned} L_D=\frac{1}{3N}\sum \limits _{x=1}^{H/S}\sum \limits _{y=1}^{W/S}\mathbbm {1}_{xy}^{obj}\left( D_{xy}- \Delta \widetilde{D}_{xy}\right) ^2\\ L_{Z}=\frac{1}{N}\sum \limits _{x=1}^{H/S}\sum \limits _{y=1}^{W/S}\mathbbm {1}_{xy}^{obj}\left( log(Z_{xy})-log(\widetilde{Z}_{xy})\right) ^2\\ \end{aligned} \end{array} \end{aligned}$$
(3)

We set \(\Delta \widetilde{D}_{xy}=log\frac{\widetilde{D}_{xy}-\bar{D}}{D_{\sigma }}\) , where \(\bar{D}\) and \(D_{\sigma }\) are the mean and standard deviation dimensions of training data. \(\mathbbm {1}_{xy}^{obj}\) denotes if maincenter appears in position xy. The offset of maincenter, vertexes are trained with an L1 loss following [45]:

$$\begin{aligned} \begin{array}{lr} L_{off}^{m}=\frac{1}{2N}\sum \limits _{x=1}^{H/S}\sum \limits _{y=1}^{W/S}\mathbbm {1}_{xy}^{obj}\left| M_{os}^{xy}-\left( \frac{p^m}{S}-\tilde{p}_{m} \right) \right| \\ L_{off}^{v}=\frac{1}{2N}\sum \limits _{x=1}^{H/S}\sum \limits _{y=1}^{W/S}\mathbbm {1}_{xy}^{ver}\left| V_{os}^{xy}-\left( \frac{p^v}{S}-\tilde{p}_{v} \right) \right| \end{array} \end{aligned}$$
(4)

where \(p^m, p^v\) are the position of maincenter and vertexes in the original image. The regression coordinate of vertexes with an L1 loss as:

$$\begin{aligned} \begin{array}{lr} \begin{aligned} L_{ver}=\frac{1}{N}\sum \limits _{k=1}^{8}\sum \limits _{x=1}^{H/S}\sum \limits _{y=1}^{W/S}\mathbbm {1}_{xy}^{ver}\left| V_{c}^{(2k-1):(2k)xy}- \left| \frac{{p^v}-{p}^m}{S} \right| \right| \end{aligned} \end{array} \end{aligned}$$
(5)

The finial multi-task loss for keypoint detection define as:

$$\begin{aligned} \begin{aligned} L&= \omega _{main} L_{kp}^C+\omega _{kpver} L_{kp}^8+\omega _{ver}L_{ver}+\omega _{dim} L_D \\&\quad +\,\omega _{ori} L_{ori}+\omega _{Z} L_{dis}+\omega _{off}^{m} L_{off}^m+\omega _{off}^{v} L_{off}^v \end{aligned} \end{aligned}$$
(6)

We empirical set \(\omega _{main}=1, \omega _{kpver}=1, \omega _{ver}=1, \omega _{dim}=1, \omega _{ori}=0.5,\omega _{dis}=0.1, \omega _{off}^{m}=0.5\) and \(\omega _{off}^{v}=0.5\) in our experimental.

3.2 3D Bounding Box Estimate

We estimate the 3D bounding box by enforcing the 2D-3D consistency between estimated 2D-related and 3D-related information, given by our keypoint detection network. Considering an image I, a set of \(i=1...N\) object are represented by 9 keypoints and other optional prior, keypoints as \(\widehat{kp}_{ij}\) for \(j\in 1...9\), dimension as \(\widehat{D}_i\), orientation as \(\hat{\theta }_i\), and distance as \(\widehat{Z}_i\). The corresponding 3D bounding box \(B_i\) can be defined by its rotation \(R_i(\theta ) \), position \(T_i = [T_i^x, T_i^y, T_i^z]^T \), and dimensions \(D_i = [h_i, w_i, l_i]^T\). Our goal is to estimate the 3D bounding box \(B_i\), whose projections of 3D center and vertexes on the image space best fit the corresponding 2D keypoints \(\widehat{kp}_{ij}\). This can be solved by minimize the reprojection error of 3D keypoints and 2D keypoints. We formulate it and other prior errors as a nonlinear least squares optimization problem:

$$\begin{aligned} \begin{aligned} R^*, T^*, D^*&= \mathop {\arg \min }\limits _{\{R,T,D\}}\sum \limits _{R_i,T_i,D_i}\left\| e_{cp}\left( R_i,T_i,D_i,\widehat{kp}_i\right) \right\| ^2_{\Sigma _{i}} \\&+\,\omega _d\left\| e_d\left( D_i,\widehat{D}_i\right) \right\| ^2_2 +\omega _r\left\| e_r\left( R_i,\hat{\theta }_i\right) \right\| ^2_2 \end{aligned} \end{aligned}$$
(7)

where \(e_{cp}(..), e_d(..), e_r(..)\) are measurement error of camera-point, dimension prior and orientation prior respectively. We set \(\omega _d=1\) and \(\omega _r=1\) in our experimental. \(\Sigma \) is the covariance matrix of keypoints projection error. It is the confidence extracted from the heatmap corresponding to the keypoints:

$$\begin{aligned} \Sigma _i=diag(softmax(V(\widehat{kp}_i)) \end{aligned}$$
(8)

In the rest of the section, we will first define this error item, and then introduce the way to optimize the formulation.

Camera-Point. Following the [9], the homogeneous coordinate of eight vertexes and 3D center can be parametrized as:

$$\begin{aligned} \begin{aligned} P_{3D}^i=diag(D_i)Cor \quad \quad \quad \quad \quad \quad \\ Cor= \left[ {\begin{matrix} 0 &{} 0 &{} 0 &{} 0 &{}-1 &{} -1 &{} -1 &{} -1 &{} -1/2 \\ 1/2 &{} -1/2&{} -1/2 &{}1/2 &{}1/2 &{}-1/2 &{}-1/2 &{}1/2 &{} 0 \\ 1/2 &{} 1/2 &{} -1/2 &{}-1/2&{}1/2 &{}1/2 &{}-1/2 &{}-1/2 &{} 0\\ 1 &{} 1 &{}1 &{} 1 &{}1 &{}1 &{}1 &{}1 &{} 1 \end{matrix}} \right] \end{aligned} \end{aligned}$$
(9)

Given the camera intrinsics matrix K, the projection of these 3D points into the image coordinate is:

$$\begin{aligned} \begin{aligned} {kp_i}=\frac{1}{s_i}K \left[ \begin{matrix} R&{} T\\ 0^T&{}1 \end{matrix} \right] diag(D_i)Cor=\frac{1}{s_i}K\exp (\xi ^{\wedge })diag(D_i)Cor \end{aligned} \end{aligned}$$
(10)

where \(\xi \in \mathfrak {se}_3\) and \(\exp \) maps the \(\mathfrak {se}_3 \) into \(SE_3\) space. The projection coordinate should fit tightly into 2D keypoints detected by the detection network. Therefore, the camera-point error is then defined as:

$$\begin{aligned} \begin{aligned} e_{cp}=\widehat{kp}_i-{kp_i} \end{aligned} \end{aligned}$$
(11)

Minimizing the camera-point error needs the Jacobians in \(\mathfrak {se}_3\) space. It is given by:

(12)

where \(P^{'}=[X^{'},Y^{'},Z^{'}]^{T}=\left( \exp (\xi ^{\wedge }P)\right) _{1:3}\).

Dimension-Prior: The \(e_d\) is simply defined as:

$$\begin{aligned} \begin{aligned} e_{d}=\widehat{D}_i-D_i \end{aligned} \end{aligned}$$
(13)

Rotation-Prior: We define \(e_r\) in SE3 space and use log to map the error into its tangent vector space:

$$\begin{aligned} \begin{aligned} e_{r}=\log (R^{-1}R(\hat{\theta }))^{\vee }_{\mathfrak {se}_3} \end{aligned} \end{aligned}$$
(14)

These multivariate equations can be solved via the Gauss-newton or Levenberg-Marquardt algorithm in the g2o library [17]. A good initialisation is mandatory using this optimization strategy. We adopt the prior information generated by keypoint detection network as the initialization value, which is very important in improving the detection speed.

4 Experimental

4.1 Implementation Details

Our experiments were evaluated on the KITTI 3D detection benchmark [9], which has a total of 7481 training images and 7518 test images. We follow the [7] and [39] to split the training set as \(train1,val_1\) and \(train2,val_2\) respectively, and comprehensively compare our framework with other methods on this two validation as well as test set.

Our deep neural network implemented by using PyTorch with the machine i7-8086K CPU and 2 1080Ti GPUs. All the original image are padded to \(1280\times 384\) for training and testing. We project the 3D bounding box of Ground Truths in the left and right images to obtain Ground Truth keypoints and use the horizontal flipping as the data augmentation, which makes our dataset is quadruple with the origin training set. We run Adam [14] optimizer with a base learning rate of 0.0002 for 300 epochs and reduce \(10\times \) at 150 and 180 epochs. For standard deviation of Gaussian kernel, we set \(\sigma _{max}=19\) and \(\sigma _{min}=3\). Based on the statistics of KITTI dataset, we set \(\tilde{l}=3.89, \tilde{w}=1.62, \tilde{h}=1.53\) and \(\sigma _{\tilde{l}}=0.41,\sigma _{\tilde{w}}=0.1,\sigma _{\tilde{h}}=0.13\). In the inference step, after \(3\times 3\) max pooling, we filter the maincenter and keypoints with threshold 0.4 and 0.1 respectively, and only keypoints that in the image size range are sent into the geometric constraint module. The backbone networks are initialized by a classification model pre-trained on the ImageNet data set. Finally, The ResNet-18 takes about three days with batch size 30 and DLA-34 for four days with batch size 16 in the training stage.

Table 2. Comparison of our framework with current image-based 3D detection methods for car category evaluated using metric \(AP_{3D}\) on \(val_1\)/\(val_2\) of KITTI data set. “Extra” means the extra data used in training. denotes the highest result, for the second highest, and for the third.
Table 3. Comparison of our framework with current image-based 3D detection frameworks for car category, evaluated using metric \(AP_{BEV}\) on \(val_1\)/\(val_2\) of KITTI data set.

4.2 Comparison with Other Methods

To fully evaluate the performance of our keypoint-based method, for each task three official evaluation metrics be reported in KITTI: average precision for 3D intersection-over-union (\(AP_{3D}\)), average precision for Birds Eye View (\(AP_{BEV}\)), and Average Orientation Similarity (AOS) if 2D bounding box available. We evaluate our method at three difficulty settings: easy, moderate, and hard, according to the object’s occlusion, truncation, and height in the image space [9].

\(\varvec{AP_{3D}}\) and \(\varvec{AP_{BEV}}\) . We compare our method with current image-based SOTA approaches and also provide a comparison about running time. However, it is not realistic to list the running times of all previous methods because most of them do not report their efficiency. The results \(AP_{3D}\), \(AP_{BEV}\) and running time are shown in Table 2 and 3, respectively. ResNet-18 as the backbone achieves the best speed while our accuracy outperforms most of the image-only method. In particular, it is more than 100 times faster than Mono3D [6] while outperforms over 10% for both \(AP_{BEV}\) and \(AP_{3d}\) across all datasets. In addition, our ResNet-18 method is more than 75 times faster while having a comparable accuracy than 3DOP [7], which employs stereo images as the input. DLA-34 as the backbone achieves the best accuracy while having relatively good speed. It is faster about 3 times than the recently proposed M3D-RPN [3] while achieves the improvement in most of the metrics. Note that comparing our method with this all approaches is unfair because most of these approaches rely on extra stand-alone network or data in addition to monocular images. Nevertheless, we achieve the best speed with better performance.

Fig. 4.
figure 4

Qualitative results of our 3D detection. From top to bottom are keypoints, projections of the 3D bounding box and bird’s eye view image, ground truths in green and our results in blue. The crimson arrows, green arrows, and red arrows point to occluded, distant, and truncated objects, respectively. (Color figure online)

Results on the KITTI Testing Set. We also evaluate our results on the KITTI testing set, as shown in Table 4.

4.3 Qualitative Results

Figure 4 shows some qualitative results of our method. We visualize the keypoint detection network outputs, geometric constraint module outputs and BEV images. The results of the projected 3D box on image demonstrate than our method can handle crowded and truncated objects. The results of the BEV image show that our method has an accuracy localization in different scenes.

4.4 Ablation Study

Effect of Optional Components. Three optional components be employed to enhance our method: dimension, orientation, distance and keypoints offset. We experiment with different combinations to demonstrate their effect on 3D detection. The results are shown in Table 5, we train our network with DLA-34 backbone and evaluate it using \(AP_{3D}\) and \(AP_{BEV}\). The combinations of dimension, orientation, distance and keypoints offset achieve the best accuracy meanwhile have a faster running speed. This is because we take the output predicted by our network as the initial value of the geometric optimization module, which can reduce the search space of the gradient descent method.

Table 4. Comparing 3D detection \(AP_{3D}\) on KITTI testing set. We use the DLA-34 as the backbone.
Table 5. Ablation study of different optional selecting results on \(val_1\) set. We use the DLA-34 as the backbone.

Effect of Keypoint FPN. We propose keypoint FPN as a strategy to improve the performance of multi-scale keypoint detection. To better understand its effect, we compare the \(AP_{3D}\) and \(AP_{BEV}\) with and without KFPN. The details are shown in Table 6, using KFPN achieves the improvement across all sets while no significant change in time consumption.

Table 6. Comparing 3D detection \(AP_{3D}\) of w/o KFPN and w/ KFPN for car category on \(val_1\) set. We use the DLA-34 as the backbone.
Table 7. Comparing of 2D detection \(AP_{2D}\) with IoU = 0.7 and orientation AOS results for car category evaluated on \(val_1\)/\(val_2\) of KITTI data set. Only the results under the moderate criteria are shown. Ours (2D) represents the results from the keypoint detection network, and Ours (3D) is the 2D bounding box of the projected 3D box.

2D Detection and Orientation. Although our focus is on 3D detection, we also report the performance of our methods in 2D detection and orientation evaluation in order to better understand the comprehensive capabilities of our approach. We report the AOS and AP with a threshold IoU = 0.7 for comparison. The results are shown in Table 7, the Deep3DBox train MS-CNN [4] in KITTI to produce 2D bounding box and adopt VGG16 [35] for orientation prediction, which gives him the highest accuracy. Deep3Dbox takes advantage of better 2D detectors, however, our \(AP_{3D}\) outperforms it by about 20% in moderate sets, which emphasize the importance of customizing the network specifically for 3D detection. Another interesting finding is that the 2D accuracy of back-projection 3D results is better than the direct prediction, thanks to our method that can infer the occlusive area of the object.

5 Conclusion

In this paper, we have proposed a faster and more accurate monocular 3D object detection method for autonomous driving scenarios. We reformulate 3D detection as the keypoint detection problem and show how to recover the 3D bounding box by using keypoints and geometric constraints. We specially customize the point detection network for 3D detection, which can simultaneously predict keypoints of the 3D box and other prior information of the object using only images. Our geometry module formulates this prior to easy-to-optimize loss functions. Our approach generates a stable and accurate 3D bounding box without containing stand-alone networks, additional annotation while achieving real-time running speed.