Keywords

1 Introduction

Monocular 3D object detection is an important topic in the self-driving and computer vision community. It is popular due to its low price and configuration simplicity. Rapid improvements [5, 6, 25, 28, 36, 60] have been conducted in recent years. A well-known challenge in this task lies in instance depth estimation, which is the bottleneck towards boosting the performance since the depth information is lost after the camera projection process.

Many previous works [2, 12, 39] directly regress the instance depth. This manner does not consider the ambiguity brought by the instance depth itself. As shown in Fig. 1, for the right object, its instance depth is the sum of car tail depth and half-length of the car, where the car length is ambiguous since both car’s left and right sides are invisible. For the left object, except for the intuitive visible surface depth, the instance depth further depends on the car dimension and orientation. We can observe that the instance depth is non-intuitive. It requires the network to additionally learn instance inherent attributes on the instance depth head. Previous direct estimation and mediate optimization methods do not fully consider this coupled nature. Thus they lead to suboptimal performance on the instance depth estimation, showing less accurate results.

Fig. 1.
figure 1

We decouple the instance depth into visual depths and attribute depths due to the coupled nature of instance depth. Please refer to the text for more details. Best viewed in color with zoom in.

Based on the analysis above, in this paper we propose to decouple the instance depth to instance visual surface depth (visual depth) and instance attribute depth (attribute depth). We illustrate some examples in Fig. 1. For each point (or small patch) on the object, the visual depth denotes the absolute depth towards the agent (car/robot) camera, and we define the attribute depth as the relative depth offset from the point (or small patch) to the object’s 3D center. This decoupled manner encourages the network to learn different feature patterns of the instance depth. Visual depth on monocular imagery depends on objects’ appearances and positions [11] on the image, which is affine-sensitive. By contrast, attribute depth highly relies on object inherent attributes (e.g., dimensions and orientations) of the object. It focuses on features inside the RoI, which is affine-invariant. (See Sect. 4.1 and 4.2 for detailed discussion). Thus the attribute depth is independent of the visual depth, and decoupled instance depth allows us to use separate heads to extract different types of features for different types of depths.

Specifically, for an object image patch, we divide it into \(m\times n\) grids. Each grid can denote a small area on the object, with being assigned a visual depth and the corresponding attribute depth. Considering the occlusion and 3D location uncertainty, we use the uncertainty to denote the confidence of each depth prediction. At inference, every object can produce \(m\times n\) instance depth predictions, thus we take advantage of them and associated uncertainties to adaptively obtain the final instance depth and confidence.

Furthermore, prior works usually are limited by the diversity of data augmentation, due to the complexity of keeping alignment between 2D and 3D objects when enforcing affine transformation in a 2D image. Based on the decoupled instance depth, we show that our method can effectively perform data augmentation, including the way using affine transformation. It is achieved by the affine-sensitive and affine-invariant nature of the visual depth and attribute depth, respectively (See Sect. 4.3). To demonstrate the effectiveness of our method, we perform experiments on the widely used KITTI dataset. The results suggest that our method outperforms prior works with a significant margin.

In summary, our contributions are listed as follows:

  1. 1.

    We point out the coupled nature of instance depth. Due to the entangled features, the previous way of directly predicting instance depth is suboptimal. Therefore, we propose to decouple the instance depth into attribute depths and visual depths, which are independently predicted.

  2. 2.

    We present two types of uncertainties to represent depth estimation confidence. Based on this, we propose to adaptively aggregate different types of depths into the final instance depth and correspondingly obtain the 3D localization confidence.

  3. 3.

    With the help of the proposed attribute depth and visual depth, we alleviate the limitation of using affine transformation in data augmentation for monocular 3D detection.

  4. 4.

    Evaluated on KITTI benchmark, our method sets the new state of the art (SOTA). Extensive ablation studies demonstrate the effectiveness of each component in our method.

2 Related Work

2.1 LiDAR-Based 3D Object Detection

LiDAR-based methods utilize precise LiDAR point clouds to achieve high performance. According to the representations usage, they can be categorized into point-based, voxel-based, hybrid, and range-view-based methods. Point-based methods [37, 42, 51] directly use raw point clouds to preserve fine-grained structures of objects. However, they usually suffer from high computational costs. Voxel-based methods [30, 50, 53, 56, 58] voxelize the unordered point clouds into regular grids so that the CNNs can be easily applied. These methods are more efficient, but voxelization introduces quantization errors, resulting in information loss. To explore advantages of different representations, some hybrid methods [8, 32, 41, 43, 52] are proposed. They validate that combining point-based and voxel-based methods can achieve a better trade-off between accuracy and efficiency. Range-view-based methods [1, 4, 13, 19] organize point clouds in range view, which is a compact representation of point clouds. These methods are also computationally efficient but are under-explored.

2.2 Monocular 3D Object Detection

Due to the low cost and setup simplicity, monocular 3D object detection is popular in recent years. Previous monocular works can be roughly divided into image-only based and depth-map based methods. The pioneer method [6] integrates different types of information such as segmentation and scene priors for performing 3D detection. To learn spatial features, OFTNet [40] projects 2D image features to the 3D space. M3D-RPN [2] attempts to extract depth-aware features by using different convolution kernels on different image columns. Kinematic3D [3] uses multi-frames to capture the temporal information by employing a tracking module. GrooMeD-NMS [18] develops a trainable NMS-step to boost the final performance. At the same period of time, many works fully take advantage of the scene and geometry priors [23, 44, 48, 59]. Due to the ill-posed nature of monocular imagery, some works resort to using the dense depth estimation [12, 26, 35, 39, 46, 47, 49]. With the help of estimated depth maps, RoI-10D [29] use CAD models to augment training samples. Pseudo-LiDAR [27, 49] based methods are also popular. They convert estimated depth maps to point clouds, then well-designed LiDAR 3D detectors can be directly employed. CaDDN [39] predicts a categorical depth distribution, to precisely project depth-aware features to 3D space. In sum, benefiting from rapid developments of deep learning technologies, monocular 3D detection has conducted remarkable progress.

2.3 Estimation of Instance Depth

Most monocular works directly predict the instance depth. There are also some works that use auxiliary information to help the instance depth estimation in the post-processing. They usually take advantage of the geometry constraints and scene priors. The early work Deep3DBox [31] regresses object dimensions and orientations, the remaining 3D box parameters including the instance depth are estimated by 2D-3D box geometry constraints. This indirect way has poor performance because it does not fully use the supervisions. To use geometric projection constraints, RTM3D [21] predicts nine perspective key-points (center and corners) of a 3D bounding box in the image space. Then the initially estimated instance depth can be optimized by minimizing projection errors. KM3D [20] follows this line and integrates this optimization into a end-to-end training process. More recently, MonoFlex [55] also predicts the nine perspective key-points. In addition to the directly predicted instance depth, it uses the projection heights in pair key-points, using geometric relationships to produce new instance depths. MonoFlex develops an ensemble strategy to obtain the final instance depth. Differing from previous works, GUPNet [25] proposes an uncertainty propagation method for the instance depth. It uses estimated object 3D dimensions and 2D height to obtain initial instance depth, with additionally predicting a depth bias to refine the instance depth. GUPNet mainly focuses on tackling the error amplification problem in the geometry projection process. MonoRCNN [44] also introduces a distance decomposition based on the 2D height and 3D height. Such methods use geometric or auxiliary information to refine the estimated instance depth, while they do not fully use the coupled nature of the instance depth.

3 Overview and Framework

Preliminaries. Monocular 3D detection takes an image captured by an RGB camera as input, predicting amodal 3D bounding boxes of objects in 3D space. These 3D boxes are parameterized by the 3D center location (xyz), dimension (hwl), and the orientation \((\theta )\). Please note, in the self-driving scenario, the orientation usually refers to the yaw angle, and the roll and pitch angles are zeros by default. Also, the ego-car/robot has been calibrated.

Fig. 2.
figure 2

Network framework. The overall design follows GUPNet [25]. The estimated 2D boxes are used to extract specific features for each object, followed by 3D box heads, which predict required 3D box parameters. The red parts in 3D heads denote the newly-proposed components. They are used to decouple the instance depth.

In this paragraph, we provide an overview and describe the framework. The overall framework is shown in Fig. 2. First, the network takes an RGB image \(\textbf{I}\in \mathbb {R}^{H\times W\times 3}\) as the input. After feature encoding, we have deep features \(\textbf{F}\in \mathbb {R}^{\frac{H}{4}\times \frac{W}{4}\times C}\), where C is the channel number. Second, deep features \(\textbf{F}\) are fed into three 2D detection heads, namely, 2D heatmap \(\textbf{H}\in \mathbb {R}^{\frac{H}{4}\times \frac{W}{4}\times B}\), 2D offset \(\textbf{O}_{2d}\in \mathbb {R}^{\frac{H}{4}\times \frac{W}{4} \times 2}\), and 2D size \(\textbf{S}_{2d}\in \mathbb {R}^{\frac{H}{4}\times \frac{W}{4} \times 2}\), where B is the number of categories. By combining such 2D head predictions, we can achieve 2D box predictions. Then, with 2D box estimates, single object features are obtained from deep features \(\textbf{F}\) by RoI Align. We have object features \(\textbf{F}_{obj}\in \mathbb {R}^{n\times 7\times 7\times C}\), where \(7 \times 7\) is the RoI Align size and n refers to the number of RoIs. Finally, these object features \(\textbf{F}_{obj}\) are fed into 3D detection heads to produce 3D parameters. Therefore, we have 3D box dimension \(\textbf{S}_{3d}\in \mathbb {R}^{n \times 3}\), 3D center projection offset \(\textbf{O}_{3d}\in \mathbb {R}^{n \times 2}\), orientation \(\varTheta \in \mathbb {R}^{n \times k \times 2}\) (we follow the multi-bin design [31] where k is the bin number), visual depth \(\textbf{D}_{vis}\in \mathbb {R}^{n \times 7\times 7}\), visual depth uncertainty \(\textbf{U}_{vis}\in \mathbb {R}^{n \times 7\times 7}\), attribute depth \(\textbf{D}_{att}\in \mathbb {R}^{n \times 7\times 7}\), and attribute depth uncertainty \(\textbf{U}_{att}\in \mathbb {R}^{n \times 7\times 7}\). Using the parameters above, we can achieve the final 3D box predictions. In the following sections, we will detail the proposed method.

4 Decoupled Instance Depth

We divide the RoI image patch into \(7\times 7\) grids, assigning each grid a visual depth value and an attribute depth value. We provide the ablation on the grid size for visual and attribute depth in experiments (See Sect. 5.4 and Table 4). In the following, we first detail the two types of depths, followed by the decoupled-depth based data augmentation, then introduce the way of obtaining the final instance depth, and finally describe loss functions.

4.1 Visual Depth

The visual depth denotes the physical depth of the object surface on the small RoI image grid. For each grid, we define the visual depth as the average pixel-wise depth within the grid. If the grid is \(1\times 1\) pixel, the visual depth is equal to the pixel-wise depth. Given that a pixel denotes the quantified surface of the object, we can regard visual depths as the general extension of pixel-wise depths.

The visual depth in monocular imagery has an important property. For a monocular-based system, visual depth highly relies on the object’s 2D box size (the faraway object appears small on the images and vice versa) and the position on the image (lower v coordinates under image coordinate system indicate larger depths) [11]. Therefore, if we perform an affine transformation to the image, the visual depth should be correspondingly transformed, where the depth value should be scaled. We call this nature the affine-sensitive.

4.2 Attribute Depth

The attribute depth refers to the depth offset from the visual surface to the object’s 3D center. We call it attribute depth because it is more likely related to the object’s inherent attributes. For example, when the car orientation is parallel to z-axis (depth direction) in 3D space, the attribute depth of the car tail is the car’s half-length. By contrast, the attribute depth is car’s half-width if the orientation is parallel to x-axis. We can see that the attribute depth depends on the object semantics and its inherent attributes. In contrast to the affine-sensitive nature of visual depth, attribute depth is invariant to any affine transformation because object inherent characteristics will not change. We call this nature the affine-invariant.

As described above, we use two separate heads to estimate the visual depth and attribute depth, respectively. The disentanglement of instance depth has several advantages: (1) The object depth is decoupled in a reasonable and intuitive manner, thus we can more comprehensively and precisely represent the object; (2) The network is allowed to extract different types of features for different types of depths, which facilitates the learning; (3) Benefitting from the decoupled depth, our method can effectively perform affine transformation based data augmentation, which is usually limited in previous works.

Fig. 3.
figure 3

Affine transformation based data augmentation. We do not change the object’s inherent attributes, i.e., attribute depths, dimensions, and observation angles. The visual depth is scaled according to the 2D height scale factor. The 3D center projection is transformed together with the image affine transformation.

4.3 Data Augmentation

In monocular 3D detection, many previous works are limited by data augmentation. Most of them only employ photometric distortion and flipping transformation. Data augmentation using affine transformation is hard to be adopted because the transformed instance depth is agnostic. Based on the decoupled depth, we point out that our method can alleviate this problem.

We illustrate an example in Fig. 3. Specifically, we add the random cropping and scaling strategy [57] in the data augmentation. The 3D center projection point on the image follows the same affine transformation process of the image. The visual depth is scaled by the scale factor along y-axis on the image because \(d=\frac{f\cdot h_{3d}}{h_{2d}}\), where \(f,h_{3d},h_{2d}\) is the focal length, object 3D height and 2D height, respectively. Conversely, the attribute depth keeps the same due to its affine-invariant nature. We do not directly scale the instance depth, as this manner will implicitly damage the attribute depth. Similarly, other inherent attributes of objects, i.e., the observation angle and the dimension, keep the same as original values. We empirically show that the data augmentation works well. We provide the ablations in experiments (See Sect. 5.4 and Table 3).

Fig. 4.
figure 4

Depth flow for an object. We use the visual depth, attribute depth, and the associated uncertainty to obtain the final instance depth.

4.4 Depth Uncertainty and Aggregation

The 2D classification score cannot fully express the confidence in monocular 3D detection because of the difficulty in 3D localization. Previous works [25, 45] use the instance depth confidence or 3D IoU loss, integrating with 2D detection confidence to represent the final 3D detection confidence. Given that we have decoupled the instance depth into visual depth and attribute depth, we can further decouple the instance depth uncertainty. Only when an object has low visual uncertainty and low attribute depth uncertainty simultaneously, the instance depth can have high confidence.

Inspired by [16, 25], we assume every depth prediction is a Laplace distribution. Specifically, for each visual depth \(d_{vis}\) in \(\textbf{D}_{vis}\in \mathbb {R}^{n \times 7\times 7}\) and the corresponding uncertainty \(u_{vis}\) in \(\textbf{U}_{vis}\in \mathbb {R}^{n \times 7\times 7}\), they follow the Laplace distribution \(L(d_{vis},u_{vis})\). Similarly, the attribute depth distribution is \(L(d_{att},u_{att})\), where \(d_{att}\) in \(\textbf{D}_{att}\in \mathbb {R}^{n \times 7\times 7}\) and \(u_{att}\) in \(\textbf{U}_{att}\in \mathbb {R}^{n \times 7\times 7}\). Therefore, the instance depth distribution derived by associated visual and attribute depth is \(L(\tilde{d}_{ins},\tilde{u}_{ins})\), where \(\tilde{d}_{ins}=d_{vis}+d_{att}\) and \(\tilde{u}_{ins}=\sqrt{u_{vis}^2+u_{att}^2}\). Then we use \(\mathbf {\tilde{D}}_{ins(patch)}\in \mathbb {R}^{n \times 7\times 7}\) and \(\mathbf {\tilde{U}}_{ins(patch)}\in \mathbb {R}^{n \times 7\times 7}\) to denote the instance depth and the uncertainty on the RoI patch. We illustrate the depth flow process in Fig. 4.

To obtain the final instance depth, we first convert the uncertainty to probability [16, 25], which can be written as \(\textbf{P}_{ins(patch)}=exp(-\mathbf {\tilde{U}}_{ins(patch)})\), where \(\textbf{P}_{ins(patch)} \in \mathbb {R}^{n \times 7\times 7}\). Then we aggregate the instance depth on the patch to the final instance depth. For the \(i^{th}\) object (\(i=1,...,n\)), we have:

$$\begin{aligned} d_{ins} = \sum \frac{\mathbf {\tilde{D}}_{ins(patch)_i} \textbf{P}_{ins(patch)_i}}{\sum \textbf{P}_{ins(patch)_i}} \end{aligned}$$
(1)

The corresponding final instance depth confidence is:

$$\begin{aligned} p_{ins} =\sum (\frac{\textbf{P}_{ins(patch)_i} }{\sum \textbf{P}_{ins(patch)_i}} \textbf{P}_{ins(patch)_i} ) \end{aligned}$$
(2)

Therefore, the final 3D detection confidence is \(p = p_{2d} \cdot p_{ins}\), where \(p_{2d}\) is the 2D detection confidence.

4.5 Loss Functions

2D Detection Part: As shown in Fig. 2, for the 2D object detection part, we follow the design in CenterNet [57]. The 2D heatmap \(\textbf{H}\) aims to indicate the rough object center on the image. The size is \(\frac{H}{4}\times \frac{W}{4}\times B\), where HW is the input image size and B is the number of categories. The 2D offset \(O_{2d}\) refers to the residual towards rough 2D centers, and the 2D size \(S_{2d}\) denotes the 2D box height and width. Following CenterNet [57], we have loss functions \(\mathcal {L}_{H},\mathcal {L}_{O_{2d}},\mathcal {L}_{S_{2d}}\), respectively.

3D Detection Part: For the 3D object dimension, we follow the typical transformation and loss design [2] \(\mathcal {L}_{S_{3d}}\). For the orientation, the network predicts the observation angle and uses the multi-bin loss [31] \(\mathcal {L}_{\varTheta }\). Also, we use the 3D center projection on the image plane and the instance depth to recover the object’s 3D location. For the 3D center projection, we achieve it by predicting the 3D projection offset to the 2D center. The loss function is: \(\mathcal {L}_{O_{3d}}=\mathrm{{Smooth}}L_1(O_{3d}, O_{3d}^*)\). We use \(*\) to denote corresponding labels. As mentioned above, we decouple the instance depth into visual depth and attribute depth. The visual depth labels are obtained by projecting LiDAR points onto the image and the attribute depth labels are obtained by subtracting instance depth labels with visual depth labels. Combing with the uncertainty [9, 16], the visual depth loss is: \(\mathcal {L}_{D_{vis}}=\frac{\sqrt{2}}{u_{vis}}\Vert d_{vis}-d_{vis}^*\Vert +log(u_{vis})\), where \(u_{vis}\) is the uncertainty. Similarly, we have attribute depth loss \(\mathcal {L}_{D_{att}}\) and instance depth loss \(\mathcal {L}_{D_{ins}}\). Among these loss terms, the losses concerning instance depth (\(\mathcal {L}_{D_{vis}}\), \(\mathcal {L}_{D_{att}}\), and \(\mathcal {L}_{D_{ins}}\)) play the most important role since they matter objects’ localization in the 3D space. We empirically set 1.0 for weights of all loss terms, and the overall loss is:

$$\begin{aligned} \mathcal {L}=\mathcal {L}_{H}+\mathcal {L}_{O_{2d}}+\mathcal {L}_{S_{2d}}+\mathcal {L}_{S_{3d}}+\mathcal {L}_{\varTheta }+\mathcal {L}_{O_{3d}}+\mathcal {L}_{D_{vis}}+\mathcal {L}_{D_{att}}+\mathcal {L}_{D_{ins}} \end{aligned}$$
(3)
Table 1. Comparisons on KITTI testing set. The refers to the highest result and is the second-highest result. Our method achieves state-of-the-art results. Note that DD3D [33] uses a large private dataset (DDAD15M), which includes 15M frames.

5 Experiments

5.1 Implementation Details

We conduct experiments on 2 NVIDIA RTX 3080Ti GPUs with batch size 16. We use the PyTorch framework [34]. We train the network with 200 epochs and employ the Hierarchical Task Learning (HTL) [25] training strategy. The Adam optimizer is used with the initial learning rate \(1e-5\). The learning rate increases to \(1e-3\) in the first 5 epochs by employing the linear warm-up strategy and decays in epoch 90 and 120 with rate 0.1. We set k 12 in the multi-bin orientation \(\varTheta \). For the backbone and head, we follow the design in [25, 54]. Inspired by CaDDN [39], we project LiDAR point clouds onto the image frame to create sparse depth maps and then depth completion [15] is performed to generate depth values at each pixel in the image. Considering the space limitation, we provide more experimental results and discussion in the supplementary material.

5.2 Dataset and Metrics

Following the commonly adopted setup in previous works [2, 12, 18, 24], we perform experiments on the widely used KITTI [14] 3D detection dataset. The KITTI dataset provides 7,481 training samples and 7,518 testing samples, where training sample labels are publicly available and the labels of testing samples keep secret in the KITTI website, which are only used for online evaluation and ranking. To conduct ablations, previous work further divides the 7,481 samples into a new training set with 3,712 samples and a val set with 3,769 samples. This data split [7] is widely adopted by most previous works. Additionally, KITTI divides objects into the easy, moderate, and hard level according to the object 2D box height (related to the depth), occlusion and truncation levels. For evaluation metrics, we use the suggested \(AP_{40}\) metric [45] under the two core tasks, i.e., 3D and bird’s-eye-view (BEV) detection.

Fig. 5.
figure 5

Qualitative results on KITTI val set. ground-truth 3D boxes; our predictions. We can observe that the model conducts accurate 3D box predictions. Best viewed in color with zoom in.

5.3 Performance on KITTI Benchmark

We compare our method (DID-M3D) with other methods in KITTI test set, which is the official benchmark for monocular 3D object detection. The results are shown in Table 1. We can see that our method achieves a new state-of-the-art. For example, compared to GUPNet [25] (ICCV21), we boost the performance from 21.19/15.02 to 22.26/16.29 under the moderate setting. As for PCT [47] (NeurIPS21), we exceed it with 3.23/2.92 AP under the moderate setting, which is a significant improvement. When compared to the recent method MonoCon [22] (AAAI22), our method still shows better performance on all BEV metrics and a 3D metric. Also, the runtime of our method is comparable to other real-time methods. Such results validate the superiority of our method. Additionally, to demonstrate the generalizability on other categories, we perform experiments on cyclist and pedestrian categories. As shown in Table 5, our method brings obvious improvements to the baseline (without employing the proposed components). The results suggest that our method works well for other categories.

Moreover, we provide qualitative results on the RGB image and 3D space in Fig. 5. We can observe that for most simple cases (e.g., close objects without occlusion and truncation), the model predictions are quite precise. However, for the heavily occluded, truncated, or faraway objects, the orientation or instance depth is less accurate. This is a common dilemma for most monocular works due to the limited information in monocular imagery. In the supplementary material, we will provide more experimental results and make detailed discussions on failure cases.

5.4 Ablation Study

To investigate the impact of each component in our method, we conduct detailed ablation studies on KITTI. Following the common practice in previous works, all ablation studies are evaluated in the val set on the car category.

Decoupled Instance Depth. We report the results in Table 2. Experiment (a) is the baseline using the direct instance depth prediction. To make fair comparisons, for the baseline, we also employ the grid design (Experiment (b)). Similar to our method, it means that the network also produce \(7 \times 7\) instance depth predictions for every object, which are all supervised in training and averaged at inference. Then, we decouple the instance depth into visual depths and attribute depths (Experiment (b) \(\rightarrow \) (c)), this simple modification improves the accuracy significantly. This result indicates the network performs suboptimally due to the coupled nature of instance depth, demonstrating our viewpoint. From Experiment (c) \(\rightarrow \) (d, e), we can see that the depth uncertainty brings improvements, because the uncertainty stabilizes the training of depth, benefiting the network learning. When simultaneously enforcing both two types of uncertainty, the performance is further boosted. Please note, the decoupled instance depth is the precondition of decoupled uncertainty. Given that the two types of depth uncertainty are achieved, we can obtain the final instance depth uncertainty (Experiment (f) \(\rightarrow \) (g)). This can be regarded as the 3D location confidence. It is used to combine with the original 2D detection confidence, which brings obvious improvements. Finally, we can use the decoupled depth and corresponding uncertainties to adaptively obtain the final instance depth (Experiment (h)), while previous experiments use the average value on the patch. We can see that this design enhances the performance. In summary, by using the decoupled depth strategy, we improve the baseline performance from 16.79/11.24 to 22.76/16.12 (Experiment (b) \(\rightarrow \) (h)). It is an impressive result. Overall, the ablations validate the effectiveness of our method.

Table 2. Ablation for decoupled instance depth. “Dec.”: decoupled; “ID.”: instance depth; “\(u_{vis}\)”: visual depth uncertainty; “\(u_{att}\)”: attribute depth uncertainty; “Conf.”: confidence; “AA.”: adaptive aggregation.

Affine Transformation Based Data Augmentation. In this paragraph we aim to understand the effect of affine transformation based data augmentation. The comparisons are shown in Table 3. We can see that the method obviously benefits from affine-based data augmentation. Note that the proper depth transformation is very important. When enforcing affine-based data augmentation, the visual depth should be scaled while the attribute depth should not be changed due to their affine-sensitive and affine-invariant nature, respectively. If we change the attribute depth without scaling visual depth, the detector even performs worse than the one without affine-based data augmentation (\(AP_{3D}\) downgrades from 12.76 to 12.65). It is because this manner misleads the network in the training with incorrect depth targets. After revising visual depth, the network can benefit from the augmented training samples, boosting the performance from 19.05/12.76 to 21.74/15.48 AP under the moderate setting. We can see that the improper visual depth can result in larger impacts compared to the improper attribute depth on the final performance, as the visual depth has a larger value range. Finally, we obtain the best performance when employing the proper visual depth and attribute depth transformation strategy.

Table 3. Ablation for affine transformation based data augmentation. “Aff. Aug.” in the table denotes the affine-based data augmentation.

Grid Size for Visual and Attribute Depth. As described in Sect. 4, we divide the RoI image patch into \(m \times m\) grids, where each grid has a visual depth and an attribute depth. This paragraph investigates the impact brought by the grid size. When increasing grid size m, visual depths and attribute depths are becoming fine-grained. This tendency makes visual depth more intuitive, which is close to the pixel-wise depth. However, the fine-grained grid will lead to suboptimal performance in terms of learning object attributes since the attributes focus on the overall object. It indicates that there exists a trade-off. Therefore, we perform ablations on the grid size m, as shown in Table 4. We achieve the best performance when m is set to 7.

Table 4. Ablation for the grid size on visual depth and attribute depth.
Table 5. Comparisons on pedestrian and cyclist categories on KITTI val set under IoU criterion 0.5. Our method brings obvious improvements to the baseline.

6 Conclusion

In this paper, we point out that the instance depth is coupled by visual depth clues and object inherent attributes. Its entangled nature makes it hard to be precisely estimated with the previous direct method. Therefore, we propose to decouple the instance depth into visual depths and attribute depths. This manner allows the network to learn different types of features for instance depth. At the inference stage, the instance depth is obtained by aggregating visual depth, attribute depth, and associated uncertainties. Using the decoupled depth, we can effectively perform affine transformation based data augmentation on the image, which is usually limited in previous works. Finally, extensive experiments demonstrate the effectiveness of our method.