RigNet: Repetitive Image Guided Network for Depth Completion

Yan, Zhiqiang; Wang, Kun; Li, Xiang; Zhang, Zhenyu; Li, Jun; Yang, Jian

doi:10.1007/978-3-031-19812-0_13

Zhiqiang Yan¹²,
Kun Wang¹²,
Xiang Li¹²,
Zhenyu Zhang¹²,
Jun Li¹² &
…
Jian Yang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13687))

Included in the following conference series:

European Conference on Computer Vision

3007 Accesses
50 Citations

Abstract

Depth completion deals with the problem of recovering dense depth maps from sparse ones, where color images are often used to facilitate this task. Recent approaches mainly focus on image guided learning frameworks to predict dense depth. However, blurry guidance in the image and unclear structure in the depth still impede the performance of the image guided frameworks. To tackle these problems, we explore a repetitive design in our image guided network to gradually and sufficiently recover depth values. Specifically, the repetition is embodied in both the image guidance branch and depth generation branch. In the former branch, we design a repetitive hourglass network to extract discriminative image features of complex environments, which can provide powerful contextual instruction for depth prediction. In the latter branch, we introduce a repetitive guidance module based on dynamic convolution, in which an efficient convolution factorization is proposed to simultaneously reduce its complexity and progressively model high-frequency structures. Extensive experiments show that our method achieves superior or competitive results on KITTI benchmark and NYUv2 dataset.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Self-attention Convolution for Sparse to Dense Depth Completion

GLDC: combining global and local consistency of multibranch depth completion

Article 26 August 2024

GraphCSPN: Geometry-Aware Depth Completion via Dynamic GCNs

Keywords

1 Introduction

Depth completion, the technique of converting sparse depth measurements to dense ones, has a variety of applications in the computer vision field, such as autonomous driving [7, 14, 50], augmented reality [8, 45], virtual reality [1], and 3D scene reconstruction [36, 42, 43, 57]. The success of these applications heavily depends on reliable depth predictions. Recently, multi-modal information from various sensors is involved to help generate dependable depth results, such as color images [3, 33], surface normals [38, 57], confidence maps [10, 49], and even binaural echoes [12, 35]. Particularly, the latest image guided methods [17, 29, 47, 59] principally concentrate on using color images to guide the recovery of dense depth maps, achieving outstanding performance. However, due to the challenging environments and limited depth measurements, it’s difficult for existing image guided methods to produce clear image guidance and structure-detailed depth features (see Figs. 2 and 6). To deal with these issues, in this paper we develop a repetitive design in both the image guidance branch and depth generation branch.

In the image guidance branch: Existing image guided methods are not sufficient to produce very precise details to provide perspicuous image guidance, which limits the content-complete depth recovery. For example, the tandem models (Fig. 1(a)) tend to only utilize the final layer features of a hourglass unit. The parallel models conduct scarce interaction between multiple hourglass units (Fig. 1(b)), or refer to image guidance encoded only by single hourglass unit (Fig. 1(c)). Different from them, as shown in Fig. 1(d), we present a vertically repetitive hourglass network to make good use of RGB features in multi-scale layers, which contain image semantics with much clearer and richer contexts.

In the depth generation branch: It is known that gradients near boundaries usually have large mutations, which increase the difficulty of recovering structure-detailed depth for convolution [48]. As evidenced in plenty of methods [10, 18, 36], the depth values are usually hard to be predicted especially around the region with unclear boundaries. To moderate this issue, in this paper we propose a repetitive guidance module based on dynamic convolution [47]. It first extracts the high-frequency components by channel-wise and cross-channel convolution factorization, and then repeatedly stacks the guidance unit to progressively produce refined depth. We also design an adaptive fusion mechanism to effectively obtain better depth representations by aggregating depth features of each repetitive unit. However, an obvious drawback of the dynamic convolution is the large GPU memory consumption, especially under the case of our repetitive structure. Hence, we further introduce an efficient module to largely reduce the memory cost but maintain the accuracy.

Benefiting from the repetitive strategy with gradually refined image/depth representations, our method performs better than others, as shown in Figs. 4, 5 and 6, and reported in Tables 3, 4, 5 and 6. In short, our contributions are:

We propose the effective but lightweight repetitive hourglass network, which can extract legible image features of challenging environments to provide clearer guidance for depth recovery.
We present the repetitive guidance module based on dynamic convolution, including an adaptive fusion mechanism and an efficient guidance algorithm, which can gradually learn precise depth representations.
Extensive experimental results demonstrate the effectiveness of our method, which achieves outstanding performances on three datasets.

2 Related Work

Depth only Approaches. For the first time in 2017, the work [48] proposes sparsity invariant CNNs to deal with sparse depth. Since then, lots of depth completion works [6, 10, 22, 24, 33, 48, 49] input depth without using color image. Distinctively, Lu et al. [32] take sparse depth as the only input with color image being auxiliary supervision when training. However, single-modal based methods are limited without other reference information. As technology quickly develops, plenty of multi-modal information is available, e.g., surface normal and optic flow images, which can significantly facilitate the depth completion task.

Image Guided Methods. Existing image guided depth completion methods can be roughly divided into two patterns. One pattern is that various maps are together input into tandem hourglass networks [3,4,5, 33, 36, 52]. For example, S2D [33] directly feeds the concatenation into a simple Unet [41]. CSPN [5] studies the affinity matrix to refine coarse depth maps with spatial propagation network (SPN). CSPN++ [4] further improves its effectiveness and efficiency by learning adaptive convolutional kernel sizes and the number of iterations for propagation. As an extension, NLSPN [36] presents non-local SPN which focuses on relevant non-local neighbors during propagation. Another pattern is using multiple independent branches to model different sensor information and then fuse them at multi-scale stages [17, 26, 29, 47, 49, 53]. For example, PENet [17] employs feature addition to guide depth learning at different stages. ACMNet [59] chooses graph propagation to capture the observed spatial contexts. GuideNet [47] seeks to predict dynamic kernel weights from the guided image and then adaptively extract the depth features. However, these methods still cannot provide very sufficient semantic guidance for the specific depth completion task.

Repetitive Learning Models. To extract more accurate and abundant feature representations, many approaches [2, 31, 37, 40] propose to repeatedly stack similar components. For example, PANet [30] adds an extra bottom-up path aggregation which is similar with its former top-down feature pyramid network (FPN). NAS-FPN [13] and BiFPN [46] conduct repetitive blocks to sufficiently encode discriminative image semantics for object detection. FCFRNet [29] argues that the feature extraction in one-stage frameworks is insufficient, and thus proposes a two-stage model, which can be regarded as a special case of the repetitive design. On this basis, PENet [17] further improves its performance by utilizing confidence maps and varietal CSPN++. Different from these methods, in our image branch we first conduct repetitive CNNs units to produce clearer guidance in multi-scale layers. Then in our depth branch we perform repetitive guidance module to generate structure-detailed depth.

3 Repetitive Design

In this section, we first introduce our repetitive hourglass network (RHN), then elaborate the proposed repetitive guidance module (RG), including an efficient guidance algorithm (EG) and an adaptive fusion mechanism (AF).

3.1 Repetitive Hourglass Network

For autonomous driving in challenging environments, it is important to understand the semantics of color images in view of the sparse depth measurement. The problem of blurry image guidance can be mitigated by a powerful feature extractor, which can obtain context-clear semantics. In this paper we present our repetitive hourglass network shown in Fig. 2. RHN$_i$ is a symmetrical hourglass unit like Unet. The original color image is first encoded by a $5\times 5$ convolution and then input into RHN$_1$. Next, we repeatedly utilize the similar but lightweight unit, each layer of which consists of two convolutions, to gradually extract high-level semantics. In the encoder of RHN$_i$, $E_{ij}$ takes $E_{i(j-1)}$ and $D_{(i-1)j}$ as input. In the decoder of RHN$_i$, $D_{ij}$ inputs $E_{ij}$ and $D_{i(j+1)}$. When $i>1$, the process is

$$\begin{aligned} \begin{aligned}&{{E}_{ij}}=\left\{ \begin{matrix} Conv\left( {{D}_{\left( i-1 \right) j}} \right) ,\qquad \; \ j=1, \\ Conv\left( {{E}_{i\left( j-1 \right) }} \right) +{{D}_{\left( i-1 \right) j}},\, 1<j\le 5, \\ \end{matrix} \right. \\&{{D}_{ij}}=\left\{ \begin{matrix} Conv\left( {{E}_{i5}} \right) ,\qquad \qquad \ j=5, \\ Deconv\left( {{D}_{i\left( j+1 \right) }} \right) +{{E}_{ij}},\ \ \ 1\le j<5, \\ \end{matrix} \right. \end{aligned} \end{aligned}$$

(1)

where ${Deconv}\left( \cdot \right) $ denotes deconvolution function, and $E_{1j}=Conv(E_{1(j-1)})$.

3.2 Repetitive Guidance Module

Depth in challenging environments is not only extremely sparse but also diverse. Most of the existing methods suffer from unclear structures, especially near the object boundaries. Since gradual refinement is proven effective [4, 36, 52] to tackle this issue, we propose our repetitive guidance module to progressively generate dense and structure-detailed depth maps. As illustrated in Fig. 2, our depth generation branch has the same architecture as RHN$_1$. Given the sparse depth input and color image guidance features $D_{ij}$ in the decoder of the last RHN, our depth branch generates final dense predictions. At the stage of the depth branch’s encoder, our repetitive guidance module (left of Fig. 3) takes $D_{ij}$ and $e_{1j}$ as input and employs the efficient guidance algorithm (in Sect. 3.2) to produce refined depth $d_{jk}$ step by step. Then we fuse the refined $d_{jk}$ by our adaptive fusion mechanism (in Sect. 3.2), obtaining the depth $d_j$,

$$\begin{aligned} {{d}_{j}}=RG\left( {{D}_{ij}},{{e}_{1j}} \right) , \end{aligned}$$

(2)

where ${RG}\left( \cdot \right) $ refers to the repetitve guidance function.

Efficient Guidance Algorithm. Suppose the size of inputs $D_{ij}$ and $e_{1j}$ are both $C\times H\times W$. It is easy to figure out the complexity of the dynamic convolution is $O(C\times C\times {{R}^{2}}\times H\times W)$, which generates spatial-variant kernels according to color image features. $R^2$ is the size of the filter kernel window. In fact, C, H, and W are usually very large, it’s thus necessary to reduce the complexity of the dynamic convolution. GuideNet [47] proposes channel-wise and cross-channel convolution factorization, whose complexity is $O(C\times {R^2}\times H\times W + C\times C)$. However, our repetitive guidance module employs the convolution factorization many times, where the channel-wise process still needs massive GPU memory consumption, which is $O(C\times {R^2}\times H\times W)$. As a result, inspired by SENet [16] that captures high-frequency response with channel-wise differentiable operations, we design an efficient guidance unit to simultaneously reduce the complexity of the channel-wise convolution and encode high-frequency components, which is shown in the top right of Fig. 3. Specifically, we first concatenate the image and depth inputs and then conduct a $3\times 3$ convolution. Next, we employ the global average pooling function to generate a $C\times 1\times 1$ feature. At last, we perform pixel-wise dot between the feature and the depth input. The complexity of our channel-wise convolution is only $O(C\times H \times W)$, reduced to ${1}/{{{R}^{2}}}\;$. The process is defined as

$$\begin{aligned} {{d}_{jk}}=\left\{ \begin{matrix} \qquad EG\left( {{D}_{ij}},{{e}_{1j}} \right) ,\qquad \ \, k=1, \\ EG\left( Conv\left( {{D}_{ij}} \right) ,{{d}_{k-1}} \right) ,\, k>1, \\ \end{matrix} \right. \end{aligned}$$

(3)

where ${EG}\left( \cdot \right) $ represents the efficient guidance function.

Suppose the memory consumptions of the common dynamic convolution, convolution factorization, and our EG are $M_{DC}$, $M_{CF}$, and $M_{EG}$, respectively.

Table 1. Theoretical analysis on GPU memory consumption.

Full size table

Table 2. Numerical analysis on GPU memory consumption.

Full size table

Table 1 shows the theoretical analysis of GPU memory consumption ratio. Under the setting of the second (4 in total) fusion stage in our depth generation branch, using 4-byte floating precision and taking $C=128$, $H=128$, $W=608$, and $R=3$, as shown in Table 2, the GPU memory of EG is reduced from 42.75 GB to 0.037 GB compared with the common dynamic convolution, nearly 1155 times lower in one fusion stage. Compared to the convolution factorization in GuideNet [47], the memory of EG is reduced from 0.334 GB to 0.037 GB, nearly 9 times lower. Therefore, we can conduct our repetitive strategy easily without worrying much about GPU memory consumption.

Adaptive Fusion Mechanism. Since many coarse depth features ($d_{j1}$, $\cdots $, $d_{jk}$) are available in our repetitive guidance module, it comes naturally to jointly utilize them to generate refined depth maps, which has been proved effective in various related methods [4, 17, 28, 36, 45, 58]. Inspired by the selective kernel convolution in SKNet [27], we propose the adaptive fusion mechanism to refine depth, which is illustrated in the bottom right of Fig. 3. Specifically, given inputs $(d_{j1}, \cdots , d_{jk})$, we first concatenate them and then perform a $3\times 3$ convolution. Next, the global average pooling is employed to produce a $C\times 1\times 1$ feature map. Then another $3\times 3$ convolution and a softmax function are applied, obtaining $(\alpha _{1},\cdots ,\alpha _{k})$,

$$\begin{aligned} {{\alpha }_{k}}=Soft\left( Conv\left( GAP\left( Conv\left( {{d}_{j1}}|| \cdots ||{{d}_{jk}} \right) \right) \right) \right) , \end{aligned}$$

(4)

where $Soft\left( \cdot \right) $ and || refer to softmax function and concatenation. $GAP\left( \cdot \right) $ represents the global average pooling operation. Finally, we fuse the k coarse depth maps using $\alpha _{k}$ to produce the output $d_j$,

$$\begin{aligned} {{d}_{j}}=\sum \nolimits _{n=1}^{k}{{{\alpha }_{n}}{{d}_{jn}}}. \end{aligned}$$

(5)

The Eqs. 4 and 5 can be denoted as

$$\begin{aligned} {{d}_{j}}=AF\left( {d}_{j1},{d}_{j2},\cdots ,{d}_{jk} \right) , \end{aligned}$$

(6)

where $AF\left( \cdot \right) $ represents the adaptive fusion function.

4 RigNet

In this section, we describe the network architecture and the loss function for training. The proposed RigNet mainly consists of two parts: (1) an image guidance branch for the generation of hierarchical and clear semantics based on the repetitive hourglass network, and (2) a depth generation branch for structure-detailed depth predictions based on the novel repetitive guidance module with an efficient guidance algorithm and an adaptive fusion mechanism.

4.1 Network Architecture

Figure 2 shows the overview of our network. In our image guidance branch, the RHN$_1$ encoder-decoder unit is built upon residual networks [15]. In addition, we adopt the common connection strategy [3, 41] to simultaneously utilize low-level and high-level features. RHN$_i$ ($i>1$) has the similar but lightweight architecture with RHN$_1$, which is used to extract clearer image guidance semantics [54].

The depth generation branch has the same structure as RHN$_1$. In this branch, we perform repetitive guidance module based on dynamic convolution to gradually produce structure-detailed depth features at multiple stages, which is shown in Fig. 3 and described in Sect. 3.2.

4.2 Loss Function

During training, we adopt the mean squared error (MSE) to compute the loss, which is defined as

$$\begin{aligned} \mathcal {L}=\frac{1}{m}\sum \limits _{q\in {{Q}_{v}}}{\left\| GT_{q}-{P}_{q} \right\| }^{2}, \end{aligned}$$

(7)

where GT and P refer to ground truth depth and predicted depth respectively. $Q_v$ represents the set of valid pixels in GT, m is the number of the valid pixels.

5 Experiments

In this section, we first introduce the related datasets, metrics, and implementation details. Then, we carry out extensive experiments to evaluate the performance of our method against other state-of-the-art approaches. Finally, a number of ablation studies are employed to verify the effectiveness of our method.

5.1 Datasets and Metrics

KITTI Depth Completion Dataset. [48] is a large autonomous driving real-world benchmark from a driving vehicle. It consists of 86,898 ground truth annotations with aligned sparse LiDAR maps and color images for training, 7,000 frames for validation, and another 1,000 frames for testing. The official 1,000 validation images are used during training while the remained images are ignored. Since there are rare LiDAR points at the top of depth maps, the input images are bottom center cropped [29, 47, 49, 59] to $1216\times 256$.

Virtual KITTI Dataset. [11] is a synthetic dataset cloned from the real world KITTI video sequences. In addition, it also produces color images under various lighting (e.g., sunset, morning) and weather (e.g., rain, fog) conditions. Following GuideNet [47], we use the masks generated from sparse depths of KITTI dataset to obtain sparse samples. Such strategy makes it closed to real-world situation for the distribution of sparse depths. Sequences of 0001, 0002, 0006, and 0018 are used for training, 0020 with various lighting and weather conditions is used for testing. It contributes to 1,289 frames for fine-tuning and 837 frames for evaluating each condition.

NYUv2 Dataset. [44] is comprised of video sequences from a variety of indoor scenes as recorded by both the color and depth cameras from the Microsoft Kinect. Paired color images and depth maps in 464 indoor scenes are commonly used. Following previous depth completion methods [3, 33, 36, 38, 47], we train our model on 50K images from the official training split, and test on the 654 images from the official labeled test set. Each image is downsized to $320\times 240$, and then $304\times 228$ center-cropping is applied. As the input resolution of our network must be a multiple of 32, we further pad the images to $320\times 256$, but evaluate only at the valid region of size $304\times 228$ to keep fair comparison with other methods.

Metrics. For the outdoor KITTI depth completion dataset, following the KITTI benchmark and existing methods [17, 29, 36, 47], we use four standard metrics for evaluation, including RMSE, MAE, iRMSE, and iMAE. For the indoor NYUv2 dataset, following previous works [3, 29, 36, 38, 47], three metrics are selected for evaluation, including RMSE, REL, and ${{\delta }_{i}}$ ($i=1.25, 1.25^2, 1.25^3$).

5.2 Implementation Details

The model is particularly trained with 4 TITAN RTX GPUs. We train it for 20 epochs with the loss defined in Eq. 7. We use ADAM [23] as the optimizer with the momentum of $\beta _{1}=0.9$, $\beta _{2}=0.999$, a starting learning rate of $1 \times {10}^{-3}$, and weight decay of $1 \times {10}^{-6}$. The learning rate drops by half every 5 epochs. The synchronized cross-GPU batch normalization [21, 55] is used when training.

Table 3. Quantitative comparisons on KITTI depth completion benchmark.

Full size table

Table 4. Quantitative comparisons on NYUv2 dataset.

Full size table

5.3 Evaluation on KITTI Dataset

Table 3 shows the quantitative results on KITTI benchmark, whose dominant evaluation metric is the RMSE. Our RigNet ranks 1st among publicly published papers when submitting, outperforming the 2nd with significant 17.42 mm improvement while the errors of other methods are very closed. Here, the performance of our RigNet is also better than those approaches that employ additional dataset, e.g., DLiDAR [38] utilizes CARLA [9] to predict surface normals for better depth predictions. Qualitative comparisons with several state-of-the-art works are shown in Fig. 4. While all methods provide visually good results in general, our estimated depth maps possess more details and more accurate object boundaries. The corresponding error maps can offer supports more clearly. For example, among the marked iron pillars in the first row of Fig. 4, the error of our prediction is significantly lower than the others.

5.4 Evaluation on NYUv2 Dataset

To verify the performance of proposed method on indoor scenes, following existing approaches [4, 29, 36, 47], we train our repetitive image guided network on the NYUv2 dataset [44] with the setting 500 sparse samples. As illustrated in Table 4, our model achieves the best performance among all traditional and latest approaches without using additional datasets, which proves that our network possesses stronger generalization capability. Figure 5 demonstrates the qualitative visualization results. Obviously, compared with those state-of-the-art methods, our RigNet can recover more detailed structures with lower errors at most pixels, including sharper boundaries and more complete object shapes. For example, among the marked doors in the last row of Fig. 5, our prediction is very close to the ground truth, while others either have large errors in the whole regions or have blurry shapes on specific objects.

Table 5. Ablation studies of RHN on KITTI validation set. denotes that we use 1 ResNet-18 as backbone, which is also the baseline. ‘Deeper’/‘More’ denotes that we conduct single &deeper/multiple &tandem hourglass units as backbone. Note that each layer of RHN$_{2,3}$ only contains two convolutions while the RHN$_{1}$ employs ResNet.

5.5 Ablation Studies

Here we employ extensive experiments to verify the effectiveness of each proposed component, including the repetitive hourglass network (RHN-Table 5) and the repetitive guidance module (RG-Table 6), which consists of the efficient guidance algorithm (EG) and the adaptive fusion mechanism (AF). Note that the batch size is set to 8 when computing the GPU memory consumption.

(1) Effect of Repetitive Hourglass Network

The state-of-the-art baseline GuideNet [47] employs 1 ResNet-18 as backbone and guided convolution G$_1$ to predict dense depth. To validate the effect of our RHN, we explore the backbone design of the image guidance branch for the specific depth completion task from four aspects, which are illustrated in Table 5.

(i) Deeper single backbone vs. RHN. The second column of Table 5 shows that, when replacing the single ResNet-10 with ResNet-18, the error is reduced by 43 mm. However, when deepening the baseline from 18 to 26/34/50, the errors have barely changed, which indicate that simply increasing the network depth of image guidance branch cannot deal well with the specific depth completion task. Differently, with little sacrifice of parameters ($\sim $2 M), our RHN-10-3 and RHN-18-3 are 24 mm and 10 mm superior to Deeper-10-1 and Deeper-18-1, respectively. Figure 6 shows that the image feature of our parallel RHN-18-3 has much clearer and richer contexts than that of the baseline Deeper-18-1.

(ii) More tandem backbones vs. RHN. As shown in the third column of Table 5, we stack the hourglass unit in series. The models of More-18-2, More-18–3, and More-18–4 have worse performances than the baseline Deeper-18-1. It turns out that the combination of tandem hourglass units is not sufficient to provide clearer image semantic guidance for the depth recovery. In contrast, our parallel RHN achieves better results with fewer parameters and smaller model sizes. These facts give strong evidence that the parallel repetitive design in image guidance branch is effective for the depth completion task.

(iii) Deeper-More backbones vs. RHN. As illustrated in the fourth column of Table 5, deeper hourglass units are deployed in serial way. We can see that the Deeper-More combinations are also not very effective, since the errors of them are higher than the baseline while RHN’s error is 10 mm lower. It verifies again the effectiveness of the lightweight RHN design.

(2) Effect of Repetitive Guidance Module

(i) Efficient guidance. Note that we directly output the features in EG$_{3}$ when not employing AF. Tables 1 and 2 have provided quantitative analysis in theory for EG design. Based on (a), we disable G$_1$ by replacing it with EG$_{1}$. Comparing (b) with (a) in Table 6, both of which carry out the guided convolution technology only once, although the error of (c) goes down a little bit, the GPU memory is heavily reduced by 11.95 GB. These results give strong evidence that our new guidance design is not only effective but also efficient.

Table 6. Ablation studies of RG/AF on KITTI validation set. RG-EG$_k$ refer to the case where we repeatedly use EG k times. ‘$\pm 0$’ refers to 23.37 GB. G$_1$ represents the raw guided convolution in GuideNet [47], which is used only once in one fusion stage.

Full size table

(ii) Repetitive guidance. When the recursion number k of EG increases, the errors of (c) and (d) are 6.3 mm and 11.2 mm significantly lower than that of (b) respectively. Meanwhile, as illustrated in Fig. 6, since our repetition in depth (d) can continuously model high-frequency components, the intermediate depth feature possesses more detailed boundaries and the corresponding image guidance branch consistently has a high response nearby the regions. These facts forcefully demonstrate the effectiveness of our repetitive guidance design.

(iii) Adaptive fusion. Based on (d) that directly outputs the feature of RG-EG$_3$, we choose to utilize all features of RG-EG$_k$ ($k=1,2,3$) to produce better depth representations. (e), (f), and (g) refer to addition, concatenation, and our AF strategies, respectively. Specifically in (f), we conduct a $3 \times 3$ convolution to control the channel to be the same as RG-EG$_3$’s after concatenation. As we can see from the ‘AF’ column of Table 6, all of the three strategies improve the performance of the model with a little bit GPU memory sacrifice (about 0–0.06 GB), which demonstrates that aggregating multi-step features in repetitive procedure is effective. Furthermore, our AF mechanism obtains the best result among them, outperforming (d) 5.3 mm. These facts prove that our AF design benefits the system better than simple fusion strategies. Detailed difference of intermediate features produced by our repetitive design is shown in Figs. 2 and 6.

5.6 Generalization Capabilities

In this subsection, we further validate the generalization capabilities of our RigNet on different sparsity, including the number of valid points, various lighting and weather conditions, and the synthetic pattern of sparse data. The corresponding results are illustrated in Figs. 7 and 8.

(1) Number of Valid Points

On KITTI selected validation split, we compare our method with four well-known approaches with available codes, i.e., S2D [33], Fusion [49], NConv [10], and ACMNet [59]. Note that, all models are pretrained on KITTI training split with raw sparsity, which is equivalent to sampling ratios of 1.0, but not fine-tuned on the generated depth inputs. Specifically, we first uniformly sample the raw depth maps with ratios (0.025, 0.05, 0.1, 0.2) and (0.4, 0.6, 0.8, 1.0) to produce the sparse depth inputs. Then we test the pretrained models on the inputs. Figure 7 shows our RigNet significantly outperforms others under all levels of sparsity in terms of both RMSE and MAE metrics. These results indicates that our method can deal well with complex data inputs.

(2) Lighting and Weather Condition

The lighting condition of KITTI dataset is almost invariable and the weather condition is good. However, both lighting and weather conditions are vitally important for depth completion, especially for self-driving service. Therefore, we fine-tune our RigNet (trained on KITTI) on ‘clone’ of Virtual KITTI [11] and test under all other different lighting and weather conditions. As shown in the right of Fig. 8, we compare ‘RG’ with ‘+’ (replace RG with addition), our method outperforms ‘+’ with large margin on RMSE. The left of Fig. 8 further demonstrates that RigNet has better performance than GuideNet [47] and ACMNet [59] in complex environments. These results verify that our method is able to handle polytropic lighting and weather conditions.

In summary, all above-mentioned evidences demonstrate that the proposed approach has robust generalization capabilities.

6 Conclusion

In this paper, we explored the repetitive design in our image guided network for depth completion task. We pointed out that there were two issues impeding the performance of existing outstanding methods, i.e., the blurry guidance in image and unclear structure in depth. To tackle the former issue, in our image guidance branch, we presented a repetitive hourglass network to produce discriminative image features. To alleviate the latter issue, in our depth generation branch, we designed a repetitive guidance module to gradually predict structure-detailed depth maps. Meanwhile, to model high-frequency components and reduce GPU memory consumption of the module, we proposed an efficient guidance algorithm. Furthermore, we designed an adaptive fusion mechanism to automatically fuse multi-stage depth features for better predictions. Extensive experiments show that our method achieves outstanding performances.

References

Armbrüster, C., Wolter, M., Kuhlen, T., Spijkers, W., Fimm, B.: Depth perception in virtual reality: distance estimations in peri-and extrapersonal space. Cyberpsychology & Behavior 11(1), 9–15 (2008)
Article Google Scholar
Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: CVPR. pp. 6154–6162 (2018)
Google Scholar
Chen, Y., Yang, B., Liang, M., Urtasun, R.: Learning joint 2d–3d representations for depth completion. In: ICCV. pp. 10023–10032 (2019)
Google Scholar
Cheng, X., Wang, P., Guan, C., Yang, R.: Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion. In: AAAI. pp. 10615–10622 (2020)
Google Scholar
Cheng, X., Wang, P., Yang, R.: Learning depth with convolutional spatial propagation network. In: ECCV, pp. 103–119 (2018)
Google Scholar
Chodosh, N., Wang, C., Lucey, S.: Deep convolutional compressed sensing for lidar depth completion, In: ACCV. pp. 499–513 (2018)
Google Scholar
Cui, Z., Heng, L., Yeo, Y.C., Geiger, A., Pollefeys, M., Sattler, T.: Real-time dense mapping for self-driving vehicles using fisheye cameras. In: ICR, pp. 6087–6093 (2019)
Google Scholar
Dey, A., Jarvis, G., Sandor, C., Reitmayr, G.: Tablet versus phone: depth perception in handheld augmented reality. In: ISMAR, pp. 187–196 (2012)
Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: an open urban driving simulator. In: CoRL, pp. 1–16. PMLR (2017)
Google Scholar
Eldesokey, A., Felsberg, M., Khan, F.S.: Confidence propagation through CNNs for guided sparse depth regression. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2423–2436 (2020)
Article Google Scholar
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR, pp. 4340–4349 (2016)
Google Scholar
Gao, R., Chen, C., Al-Halah, Z., Schissler, C., Grauman, K.: VisualEchoes: spatial image representation learning through echolocation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 658–676. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_38
Chapter Google Scholar
Ghiasi, G., Lin, T.Y., Le, Q.V.: NAS-FPN: learning scalable feature pyramid architecture for object detection. In: CVPR, pp. 7036–7045 (2019)
Google Scholar
Häne, C., et al.: 3d visual perception for self-driving cars using a multi-camera system: calibration, mapping, localization, and obstacle detection. Image Vis. Comput. 68, 14–27 (2017)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Google Scholar
Hu, M., Wang, S., Li, B., Ning, S., Fan, L., Gong, X.: PENet: towards precise and efficient image guided depth completion. In: ICRA (2021)
Google Scholar
Huang, Y.K., Wu, T.H., Liu, Y.C., Hsu, W.H.: Indoor depth completion with boundary consistency and self-attention. In: ICCV Workshops (2019)
Google Scholar
Imran, S., Liu, X., Morris, D.: Depth completion with twin surface extrapolation at occlusion boundaries. In: CVPR, pp. 2583–2592 (2021)
Google Scholar
Imran, S., Long, Y., Liu, X., Morris, D.: Depth coefficients for depth completion. In: CVPR, pp. 12438–12447. IEEE (2019)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456. PMLR (2015)
Google Scholar
Jaritz, M., De Charette, R., Wirbel, E., Perrotton, X., Nashashibi, F.: Sparse and dense data with CNNs: depth completion and semantic segmentation. In: 3DV, pp. 52–60 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Computer Ence (2014)
Google Scholar
Ku, J., Harakeh, A., Waslander, S.L.: In defense of classical image processing: Fast depth completion on the CPU. In: CRV, pp. 16–22 (2018)
Google Scholar
Lee, B.U., Lee, K., Kweon, I.S.: Depth completion using plane-residual representation. In: CVPR, pp. 13916–13925 (2021)
Google Scholar
Li, A., Yuan, Z., Ling, Y., Chi, W., Zhang, C., et al.: A multi-scale guided cascade hourglass network for depth completion. In: WACV, pp. 32–40 (2020)
Google Scholar
Li, X., Wang, W., Hu, X., Yang, J.: Selective kernel networks. In: CVPR, pp. 510–519 (2019)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)
Google Scholar
Liu, L., et al.: FCFR-Net: feature fusion based coarse-to-fine residual learning for depth completion. In: AAAI, vol. 35, pp. 2136–2144 (2021)
Google Scholar
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR, pp. 8759–8768 (2018)
Google Scholar
Liu, Y., et al.: CBNet: a novel composite backbone network architecture for object detection. In: AAAI, vol. 34, pp. 11653–11660 (2020)
Google Scholar
Lu, K., Barnes, N., Anwar, S., Zheng, L.: From depth what can you see? depth completion via auxiliary image reconstruction. In: CVPR, pp. 11306–11315 (2020)
Google Scholar
Ma, F., Cavalheiro, G.V., Karaman, S.: Self-supervised sparse-to-dense: self-supervised depth completion from lidar and monocular camera. In: ICRA (2019)
Google Scholar
Ma, F., Karaman, S.: Sparse-to-dense: depth prediction from sparse depth samples and a single image. In: ICRA, pp. 4796–4803. IEEE (2018)
Google Scholar
Parida, K.K., Srivastava, S., Sharma, G.: Beyond image to depth: improving depth prediction using echoes. In: CVPR, pp. 8268–8277 (2021)
Google Scholar
Park, J., Joo, K., Hu, Z., Liu, C.K., Kweon, I.S.: Non-local spatial propagation network for depth completion. In: ECCV (2020)
Google Scholar
Qiao, S., Chen, L.C., Yuille, A.: Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. In: CVPR, pp. 10213–10224 (2021)
Google Scholar
Qiu, J., et al.: DeepLiDAR: deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In: CVPR, pp. 3313–3322 (2019)
Google Scholar
Qu, C., Liu, W., Taylor, C.J.: Bayesian deep basis fitting for depth completion with uncertainty. In: ICCV, pp. 16147–16157 (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. NeurIPS 28, 91–99 (2015)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shen, Z., Lin, C., Liao, K., Nie, L., Zheng, Z., Zhao, Y.: PanoFormer: panorama transformer for indoor 360 depth estimation. arXiv e-prints pp. arXiv-2203 (2022)
Google Scholar
Shen, Z., Lin, C., Nie, L., Liao, K., Zhao, Y.: Distortion-tolerant monocular depth estimation on omnidirectional images using dual-cubemap. In: ICME, pp. 1–6. IEEE (2021)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Song, X., et al.: Channel attention based iterative residual learning for depth map super-resolution. In: CVPR, pp. 5631–5640 (2020)
Google Scholar
Tan, M., Pang, R., Le, Q.V.: EfficientDet: scalable and efficient object detection. In: CVPR, pp. 10781–10790 (2020)
Google Scholar
Tang, J., Tian, F.P., Feng, W., Li, J., Tan, P.: Learning guided convolutional network for depth completion. IEEE Trans. Image Process. 30, 1116–1129 (2020)
Article Google Scholar
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: 3DV, pp. 11–20 (2017)
Google Scholar
Van Gansbeke, W., Neven, D., De Brabandere, B., Van Gool, L.: Sparse and noisy lidar completion with RGB guidance and uncertainty. In: MVA, pp. 1–6 (2019)
Google Scholar
Wang, K., et al.: Regularizing nighttime weirdness: efficient self-supervised monocular depth estimation in the dark. In: ICCV, pp. 16055–16064 (2021)
Google Scholar
Xu, Y., Zhu, X., Shi, J., Zhang, G., Bao, H., Li, H.: Depth completion from sparse lidar data with depth-normal constraints. In: ICCV, pp. 2811–2820 (2019)
Google Scholar
Xu, Z., Yin, H., Yao, J.: Deformable spatial propagation networks for depth completion. In: ICIP, pp. 913–917. IEEE (2020)
Google Scholar
Yang, Y., Wong, A., Soatto, S.: Dense depth posterior (DDP) from single image and sparse range. In: CVPR, pp. 3353–3362 (2020)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zhang, H., et al.: Context encoding for semantic segmentation. In: CVPR, pp. 7151–7160 (2018)
Google Scholar
Zhang, Y., Funkhouser, T.: Deep depth completion of a single RGB-d image. In: CVPR, pp. 175–185 (2018)
Google Scholar
Zhang, Z., Cui, Z., Xu, C., Yan, Y., Sebe, N., Yang, J.: Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: CVPR, pp. 4106–4115 (2019)
Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 2881–2890 (2017)
Google Scholar
Zhao, S., Gong, M., Fu, H., Tao, D.: Adaptive context-aware multi-modal network for depth completion. IEEE Trans. Image Process. 30, 5264–5276 (2021)
Article Google Scholar
Zhu, Y., Dong, W., Li, L., Wu, J., Li, X., Shi, G.: Robust depth completion with uncertainty-driven loss functions. arXiv preprint arXiv:2112.07895 (2021)

Download references

Acknowledgement

The authors would like to thank reviewers for their detailed comments and instructive suggestions. This work was supported by the National Science Fund of China under Grant Nos. U1713208, 62072242 and Postdoctoral Innovative Talent Support Program of China under Grant BX20200168, 2020M681608. Note that the PCA Lab is associated with, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, Nanjing University of Science and Technology.

Author information

Authors and Affiliations

PCA Lab, Nanjing University of Science and Technology, Nanjing, China
Zhiqiang Yan, Kun Wang, Xiang Li, Zhenyu Zhang, Jun Li & Jian Yang

Authors

Zhiqiang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Kun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jun Li or Jian Yang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3548 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, Z., Wang, K., Li, X., Zhang, Z., Li, J., Yang, J. (2022). RigNet: Repetitive Image Guided Network for Depth Completion. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13687. Springer, Cham. https://doi.org/10.1007/978-3-031-19812-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-19812-0_13
Published: 30 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19811-3
Online ISBN: 978-3-031-19812-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RigNet: Repetitive Image Guided Network for Depth Completion

Abstract

Similar content being viewed by others

Self-attention Convolution for Sparse to Dense Depth Completion

GLDC: combining global and local consistency of multibranch depth completion

GraphCSPN: Geometry-Aware Depth Completion via Dynamic GCNs

Keywords

1 Introduction

2 Related Work