Keywords

1 Introduction

Depth completion, the technique of converting sparse depth measurements to dense ones, has a variety of applications in the computer vision field, such as autonomous driving [7, 14, 50], augmented reality [8, 45], virtual reality [1], and 3D scene reconstruction [36, 42, 43, 57]. The success of these applications heavily depends on reliable depth predictions. Recently, multi-modal information from various sensors is involved to help generate dependable depth results, such as color images [3, 33], surface normals [38, 57], confidence maps [10, 49], and even binaural echoes [12, 35]. Particularly, the latest image guided methods [17, 29, 47, 59] principally concentrate on using color images to guide the recovery of dense depth maps, achieving outstanding performance. However, due to the challenging environments and limited depth measurements, it’s difficult for existing image guided methods to produce clear image guidance and structure-detailed depth features (see Figs. 2 and 6). To deal with these issues, in this paper we develop a repetitive design in both the image guidance branch and depth generation branch.

Fig. 1.
figure 1

To obtain dense depth Prediction, most existing image guided methods employ tandem models [4, 33, 36] (a) or parallel models  [17, 29, 47, 59] (b,c) with various inputs (e.g., Boundary/Confidence/Normal/RGB-D), whilst we propose the repetitive mechanism (d), aiming at providing gradually refined image/depth Guidance.

In the image guidance branch: Existing image guided methods are not sufficient to produce very precise details to provide perspicuous image guidance, which limits the content-complete depth recovery. For example, the tandem models (Fig. 1(a)) tend to only utilize the final layer features of a hourglass unit. The parallel models conduct scarce interaction between multiple hourglass units (Fig. 1(b)), or refer to image guidance encoded only by single hourglass unit (Fig. 1(c)). Different from them, as shown in Fig. 1(d), we present a vertically repetitive hourglass network to make good use of RGB features in multi-scale layers, which contain image semantics with much clearer and richer contexts.

In the depth generation branch: It is known that gradients near boundaries usually have large mutations, which increase the difficulty of recovering structure-detailed depth for convolution [48]. As evidenced in plenty of methods [10, 18, 36], the depth values are usually hard to be predicted especially around the region with unclear boundaries. To moderate this issue, in this paper we propose a repetitive guidance module based on dynamic convolution [47]. It first extracts the high-frequency components by channel-wise and cross-channel convolution factorization, and then repeatedly stacks the guidance unit to progressively produce refined depth. We also design an adaptive fusion mechanism to effectively obtain better depth representations by aggregating depth features of each repetitive unit. However, an obvious drawback of the dynamic convolution is the large GPU memory consumption, especially under the case of our repetitive structure. Hence, we further introduce an efficient module to largely reduce the memory cost but maintain the accuracy.

Benefiting from the repetitive strategy with gradually refined image/depth representations, our method performs better than others, as shown in Figs. 45 and 6, and reported in Tables 345 and 6. In short, our contributions are:

  • We propose the effective but lightweight repetitive hourglass network, which can extract legible image features of challenging environments to provide clearer guidance for depth recovery.

  • We present the repetitive guidance module based on dynamic convolution, including an adaptive fusion mechanism and an efficient guidance algorithm, which can gradually learn precise depth representations.

  • Extensive experimental results demonstrate the effectiveness of our method, which achieves outstanding performances on three datasets.

2 Related Work

Depth only Approaches. For the first time in 2017, the work [48] proposes sparsity invariant CNNs to deal with sparse depth. Since then, lots of depth completion works [6, 10, 22, 24, 33, 48, 49] input depth without using color image. Distinctively, Lu et al. [32] take sparse depth as the only input with color image being auxiliary supervision when training. However, single-modal based methods are limited without other reference information. As technology quickly develops, plenty of multi-modal information is available, e.g., surface normal and optic flow images, which can significantly facilitate the depth completion task.

Image Guided Methods. Existing image guided depth completion methods can be roughly divided into two patterns. One pattern is that various maps are together input into tandem hourglass networks [3,4,5, 33, 36, 52]. For example, S2D [33] directly feeds the concatenation into a simple Unet [41]. CSPN [5] studies the affinity matrix to refine coarse depth maps with spatial propagation network (SPN). CSPN++ [4] further improves its effectiveness and efficiency by learning adaptive convolutional kernel sizes and the number of iterations for propagation. As an extension, NLSPN [36] presents non-local SPN which focuses on relevant non-local neighbors during propagation. Another pattern is using multiple independent branches to model different sensor information and then fuse them at multi-scale stages [17, 26, 29, 47, 49, 53]. For example, PENet [17] employs feature addition to guide depth learning at different stages. ACMNet [59] chooses graph propagation to capture the observed spatial contexts. GuideNet [47] seeks to predict dynamic kernel weights from the guided image and then adaptively extract the depth features. However, these methods still cannot provide very sufficient semantic guidance for the specific depth completion task.

Repetitive Learning Models. To extract more accurate and abundant feature representations, many approaches [2, 31, 37, 40] propose to repeatedly stack similar components. For example, PANet [30] adds an extra bottom-up path aggregation which is similar with its former top-down feature pyramid network (FPN). NAS-FPN [13] and BiFPN [46] conduct repetitive blocks to sufficiently encode discriminative image semantics for object detection. FCFRNet [29] argues that the feature extraction in one-stage frameworks is insufficient, and thus proposes a two-stage model, which can be regarded as a special case of the repetitive design. On this basis, PENet [17] further improves its performance by utilizing confidence maps and varietal CSPN++. Different from these methods, in our image branch we first conduct repetitive CNNs units to produce clearer guidance in multi-scale layers. Then in our depth branch we perform repetitive guidance module to generate structure-detailed depth.

Fig. 2.
figure 2

Overview of our repetitive image guided network, which contains an image guidance branch and a depth generation branch. The former consists of a repetitive hourglass network (RHN) and the latter has the similar structure as RHN\(_1\). In the depth branch, we perform our novel repetitive guidance module (RG, elaborated in Fig. 3) to refine depth. In addition, an efficient guidance algorithm (EG) and an adaptive fusion mechanism (AF) are proposed to further improve the performance of the module.

3 Repetitive Design

In this section, we first introduce our repetitive hourglass network (RHN), then elaborate the proposed repetitive guidance module (RG), including an efficient guidance algorithm (EG) and an adaptive fusion mechanism (AF).

3.1 Repetitive Hourglass Network

For autonomous driving in challenging environments, it is important to understand the semantics of color images in view of the sparse depth measurement. The problem of blurry image guidance can be mitigated by a powerful feature extractor, which can obtain context-clear semantics. In this paper we present our repetitive hourglass network shown in Fig. 2. RHN\(_i\) is a symmetrical hourglass unit like Unet. The original color image is first encoded by a \(5\times 5\) convolution and then input into RHN\(_1\). Next, we repeatedly utilize the similar but lightweight unit, each layer of which consists of two convolutions, to gradually extract high-level semantics. In the encoder of RHN\(_i\), \(E_{ij}\) takes \(E_{i(j-1)}\) and \(D_{(i-1)j}\) as input. In the decoder of RHN\(_i\), \(D_{ij}\) inputs \(E_{ij}\) and \(D_{i(j+1)}\). When \(i>1\), the process is

$$\begin{aligned} \begin{aligned}&{{E}_{ij}}=\left\{ \begin{matrix} Conv\left( {{D}_{\left( i-1 \right) j}} \right) ,\qquad \; \ j=1, \\ Conv\left( {{E}_{i\left( j-1 \right) }} \right) +{{D}_{\left( i-1 \right) j}},\, 1<j\le 5, \\ \end{matrix} \right. \\&{{D}_{ij}}=\left\{ \begin{matrix} Conv\left( {{E}_{i5}} \right) ,\qquad \qquad \ j=5, \\ Deconv\left( {{D}_{i\left( j+1 \right) }} \right) +{{E}_{ij}},\ \ \ 1\le j<5, \\ \end{matrix} \right. \end{aligned} \end{aligned}$$
(1)

where \({Deconv}\left( \cdot \right) \) denotes deconvolution function, and \(E_{1j}=Conv(E_{1(j-1)})\).

Fig. 3.
figure 3

Our repetitive guidance (RG) implemented by an efficient guidance algorithm (EG) and an adaptive fusion mechanism (AF). k refers to the repetitive number.

3.2 Repetitive Guidance Module

Depth in challenging environments is not only extremely sparse but also diverse. Most of the existing methods suffer from unclear structures, especially near the object boundaries. Since gradual refinement is proven effective [4, 36, 52] to tackle this issue, we propose our repetitive guidance module to progressively generate dense and structure-detailed depth maps. As illustrated in Fig. 2, our depth generation branch has the same architecture as RHN\(_1\). Given the sparse depth input and color image guidance features \(D_{ij}\) in the decoder of the last RHN, our depth branch generates final dense predictions. At the stage of the depth branch’s encoder, our repetitive guidance module (left of Fig. 3) takes \(D_{ij}\) and \(e_{1j}\) as input and employs the efficient guidance algorithm (in Sect. 3.2) to produce refined depth \(d_{jk}\) step by step. Then we fuse the refined \(d_{jk}\) by our adaptive fusion mechanism (in Sect. 3.2), obtaining the depth \(d_j\),

$$\begin{aligned} {{d}_{j}}=RG\left( {{D}_{ij}},{{e}_{1j}} \right) , \end{aligned}$$
(2)

where \({RG}\left( \cdot \right) \) refers to the repetitve guidance function.

Efficient Guidance Algorithm. Suppose the size of inputs \(D_{ij}\) and \(e_{1j}\) are both \(C\times H\times W\). It is easy to figure out the complexity of the dynamic convolution is \(O(C\times C\times {{R}^{2}}\times H\times W)\), which generates spatial-variant kernels according to color image features. \(R^2\) is the size of the filter kernel window. In fact, C, H, and W are usually very large, it’s thus necessary to reduce the complexity of the dynamic convolution. GuideNet [47] proposes channel-wise and cross-channel convolution factorization, whose complexity is \(O(C\times {R^2}\times H\times W + C\times C)\). However, our repetitive guidance module employs the convolution factorization many times, where the channel-wise process still needs massive GPU memory consumption, which is \(O(C\times {R^2}\times H\times W)\). As a result, inspired by SENet [16] that captures high-frequency response with channel-wise differentiable operations, we design an efficient guidance unit to simultaneously reduce the complexity of the channel-wise convolution and encode high-frequency components, which is shown in the top right of Fig. 3. Specifically, we first concatenate the image and depth inputs and then conduct a \(3\times 3\) convolution. Next, we employ the global average pooling function to generate a \(C\times 1\times 1\) feature. At last, we perform pixel-wise dot between the feature and the depth input. The complexity of our channel-wise convolution is only \(O(C\times H \times W)\), reduced to \({1}/{{{R}^{2}}}\;\). The process is defined as

$$\begin{aligned} {{d}_{jk}}=\left\{ \begin{matrix} \qquad EG\left( {{D}_{ij}},{{e}_{1j}} \right) ,\qquad \ \, k=1, \\ EG\left( Conv\left( {{D}_{ij}} \right) ,{{d}_{k-1}} \right) ,\, k>1, \\ \end{matrix} \right. \end{aligned}$$
(3)

where \({EG}\left( \cdot \right) \) represents the efficient guidance function.

Suppose the memory consumptions of the common dynamic convolution, convolution factorization, and our EG are \(M_{DC}\), \(M_{CF}\), and \(M_{EG}\), respectively.

Table 1. Theoretical analysis on GPU memory consumption.
Table 2. Numerical analysis on GPU memory consumption.

Table 1 shows the theoretical analysis of GPU memory consumption ratio. Under the setting of the second (4 in total) fusion stage in our depth generation branch, using 4-byte floating precision and taking \(C=128\), \(H=128\), \(W=608\), and \(R=3\), as shown in Table 2, the GPU memory of EG is reduced from 42.75 GB to 0.037 GB compared with the common dynamic convolution, nearly 1155 times lower in one fusion stage. Compared to the convolution factorization in GuideNet [47], the memory of EG is reduced from 0.334 GB to 0.037 GB, nearly 9 times lower. Therefore, we can conduct our repetitive strategy easily without worrying much about GPU memory consumption.

Adaptive Fusion Mechanism. Since many coarse depth features (\(d_{j1}\), \(\cdots \), \(d_{jk}\)) are available in our repetitive guidance module, it comes naturally to jointly utilize them to generate refined depth maps, which has been proved effective in various related methods [4, 17, 28, 36, 45, 58]. Inspired by the selective kernel convolution in SKNet [27], we propose the adaptive fusion mechanism to refine depth, which is illustrated in the bottom right of Fig. 3. Specifically, given inputs \((d_{j1}, \cdots , d_{jk})\), we first concatenate them and then perform a \(3\times 3\) convolution. Next, the global average pooling is employed to produce a \(C\times 1\times 1\) feature map. Then another \(3\times 3\) convolution and a softmax function are applied, obtaining \((\alpha _{1},\cdots ,\alpha _{k})\),

$$\begin{aligned} {{\alpha }_{k}}=Soft\left( Conv\left( GAP\left( Conv\left( {{d}_{j1}}|| \cdots ||{{d}_{jk}} \right) \right) \right) \right) , \end{aligned}$$
(4)

where \(Soft\left( \cdot \right) \) and || refer to softmax function and concatenation. \(GAP\left( \cdot \right) \) represents the global average pooling operation. Finally, we fuse the k coarse depth maps using \(\alpha _{k}\) to produce the output \(d_j\),

$$\begin{aligned} {{d}_{j}}=\sum \nolimits _{n=1}^{k}{{{\alpha }_{n}}{{d}_{jn}}}. \end{aligned}$$
(5)

The Eqs. 4 and 5 can be denoted as

$$\begin{aligned} {{d}_{j}}=AF\left( {d}_{j1},{d}_{j2},\cdots ,{d}_{jk} \right) , \end{aligned}$$
(6)

where \(AF\left( \cdot \right) \) represents the adaptive fusion function.

4 RigNet

In this section, we describe the network architecture and the loss function for training. The proposed RigNet mainly consists of two parts: (1) an image guidance branch for the generation of hierarchical and clear semantics based on the repetitive hourglass network, and (2) a depth generation branch for structure-detailed depth predictions based on the novel repetitive guidance module with an efficient guidance algorithm and an adaptive fusion mechanism.

4.1 Network Architecture

Figure 2 shows the overview of our network. In our image guidance branch, the RHN\(_1\) encoder-decoder unit is built upon residual networks [15]. In addition, we adopt the common connection strategy [3, 41] to simultaneously utilize low-level and high-level features. RHN\(_i\) (\(i>1\)) has the similar but lightweight architecture with RHN\(_1\), which is used to extract clearer image guidance semantics [54].

The depth generation branch has the same structure as RHN\(_1\). In this branch, we perform repetitive guidance module based on dynamic convolution to gradually produce structure-detailed depth features at multiple stages, which is shown in Fig. 3 and described in Sect. 3.2.

4.2 Loss Function

During training, we adopt the mean squared error (MSE) to compute the loss, which is defined as

$$\begin{aligned} \mathcal {L}=\frac{1}{m}\sum \limits _{q\in {{Q}_{v}}}{\left\| GT_{q}-{P}_{q} \right\| }^{2}, \end{aligned}$$
(7)

where GT and P refer to ground truth depth and predicted depth respectively. \(Q_v\) represents the set of valid pixels in GT, m is the number of the valid pixels.

5 Experiments

In this section, we first introduce the related datasets, metrics, and implementation details. Then, we carry out extensive experiments to evaluate the performance of our method against other state-of-the-art approaches. Finally, a number of ablation studies are employed to verify the effectiveness of our method.

5.1 Datasets and Metrics

KITTI Depth Completion Dataset. [48] is a large autonomous driving real-world benchmark from a driving vehicle. It consists of 86,898 ground truth annotations with aligned sparse LiDAR maps and color images for training, 7,000 frames for validation, and another 1,000 frames for testing. The official 1,000 validation images are used during training while the remained images are ignored. Since there are rare LiDAR points at the top of depth maps, the input images are bottom center cropped [29, 47, 49, 59] to \(1216\times 256\).

Virtual KITTI Dataset. [11] is a synthetic dataset cloned from the real world KITTI video sequences. In addition, it also produces color images under various lighting (e.g., sunset, morning) and weather (e.g., rain, fog) conditions. Following GuideNet [47], we use the masks generated from sparse depths of KITTI dataset to obtain sparse samples. Such strategy makes it closed to real-world situation for the distribution of sparse depths. Sequences of 0001, 0002, 0006, and 0018 are used for training, 0020 with various lighting and weather conditions is used for testing. It contributes to 1,289 frames for fine-tuning and 837 frames for evaluating each condition.

NYUv2 Dataset. [44] is comprised of video sequences from a variety of indoor scenes as recorded by both the color and depth cameras from the Microsoft Kinect. Paired color images and depth maps in 464 indoor scenes are commonly used. Following previous depth completion methods [3, 33, 36, 38, 47], we train our model on 50K images from the official training split, and test on the 654 images from the official labeled test set. Each image is downsized to \(320\times 240\), and then \(304\times 228\) center-cropping is applied. As the input resolution of our network must be a multiple of 32, we further pad the images to \(320\times 256\), but evaluate only at the valid region of size \(304\times 228\) to keep fair comparison with other methods.

Metrics. For the outdoor KITTI depth completion dataset, following the KITTI benchmark and existing methods [17, 29, 36, 47], we use four standard metrics for evaluation, including RMSE, MAE, iRMSE, and iMAE. For the indoor NYUv2 dataset, following previous works [3, 29, 36, 38, 47], three metrics are selected for evaluation, including RMSE, REL, and \({{\delta }_{i}}\) (\(i=1.25, 1.25^2, 1.25^3\)).

5.2 Implementation Details

The model is particularly trained with 4 TITAN RTX GPUs. We train it for 20 epochs with the loss defined in Eq. 7. We use ADAM [23] as the optimizer with the momentum of \(\beta _{1}=0.9\), \(\beta _{2}=0.999\), a starting learning rate of \(1 \times {10}^{-3}\), and weight decay of \(1 \times {10}^{-6}\). The learning rate drops by half every 5 epochs. The synchronized cross-GPU batch normalization [21, 55] is used when training.

Table 3. Quantitative comparisons on KITTI depth completion benchmark.
Table 4. Quantitative comparisons on NYUv2 dataset.

5.3 Evaluation on KITTI Dataset

Table 3 shows the quantitative results on KITTI benchmark, whose dominant evaluation metric is the RMSE. Our RigNet ranks 1st among publicly published papers when submitting, outperforming the 2nd with significant 17.42 mm improvement while the errors of other methods are very closed. Here, the performance of our RigNet is also better than those approaches that employ additional dataset, e.g., DLiDAR [38] utilizes CARLA [9] to predict surface normals for better depth predictions. Qualitative comparisons with several state-of-the-art works are shown in Fig. 4. While all methods provide visually good results in general, our estimated depth maps possess more details and more accurate object boundaries. The corresponding error maps can offer supports more clearly. For example, among the marked iron pillars in the first row of Fig. 4, the error of our prediction is significantly lower than the others.

5.4 Evaluation on NYUv2 Dataset

To verify the performance of proposed method on indoor scenes, following existing approaches [4, 29, 36, 47], we train our repetitive image guided network on the NYUv2 dataset [44] with the setting 500 sparse samples. As illustrated in Table 4, our model achieves the best performance among all traditional and latest approaches without using additional datasets, which proves that our network possesses stronger generalization capability. Figure 5 demonstrates the qualitative visualization results. Obviously, compared with those state-of-the-art methods, our RigNet can recover more detailed structures with lower errors at most pixels, including sharper boundaries and more complete object shapes. For example, among the marked doors in the last row of Fig. 5, our prediction is very close to the ground truth, while others either have large errors in the whole regions or have blurry shapes on specific objects.

Fig. 4.
figure 4

Qualitative results on KITTI depth completion test set, including (b) GuideNet [47], (c) FCFRNet [29], and (d) CSPN [5]. Given sparse depth maps and the aligned color images (1st column), depth completion models output dense depth predictions (e.g., 2nd column). We provide error maps borrowed from the KITTI leaderboard for detailed discrimination. Warmer color in error maps refer to higher error.

Fig. 5.
figure 5

Qualitative results on NYUv2 test set. From left to right: (a) color image, (b) sparse depth, (c) NLSPN [36], (d) ACMNet [59], (e) CSPN [5], (f) our RigNet, and (g) ground truth. We present the results of these four methods under 500 samples. The circled rectangle areas show the recovery of object details.

Table 5. Ablation studies of RHN on KITTI validation set. denotes that we use 1 ResNet-18 as backbone, which is also the baseline. ‘Deeper’/‘More’ denotes that we conduct single &deeper/multiple &tandem hourglass units as backbone. Note that each layer of RHN\(_{2,3}\) only contains two convolutions while the RHN\(_{1}\) employs ResNet.

5.5 Ablation Studies

Here we employ extensive experiments to verify the effectiveness of each proposed component, including the repetitive hourglass network (RHN-Table 5) and the repetitive guidance module (RG-Table 6), which consists of the efficient guidance algorithm (EG) and the adaptive fusion mechanism (AF). Note that the batch size is set to 8 when computing the GPU memory consumption.

(1) Effect of Repetitive Hourglass Network

The state-of-the-art baseline GuideNet [47] employs 1 ResNet-18 as backbone and guided convolution G\(_1\) to predict dense depth. To validate the effect of our RHN, we explore the backbone design of the image guidance branch for the specific depth completion task from four aspects, which are illustrated in Table 5.

(i) Deeper single backbone vs. RHN. The second column of Table 5 shows that, when replacing the single ResNet-10 with ResNet-18, the error is reduced by 43 mm. However, when deepening the baseline from 18 to 26/34/50, the errors have barely changed, which indicate that simply increasing the network depth of image guidance branch cannot deal well with the specific depth completion task. Differently, with little sacrifice of parameters (\(\sim \)2 M), our RHN-10-3 and RHN-18-3 are 24 mm and 10 mm superior to Deeper-10-1 and Deeper-18-1, respectively. Figure 6 shows that the image feature of our parallel RHN-18-3 has much clearer and richer contexts than that of the baseline Deeper-18-1.

(ii) More tandem backbones vs. RHN. As shown in the third column of Table 5, we stack the hourglass unit in series. The models of More-18-2, More-18–3, and More-18–4 have worse performances than the baseline Deeper-18-1. It turns out that the combination of tandem hourglass units is not sufficient to provide clearer image semantic guidance for the depth recovery. In contrast, our parallel RHN achieves better results with fewer parameters and smaller model sizes. These facts give strong evidence that the parallel repetitive design in image guidance branch is effective for the depth completion task.

(iii) Deeper-More backbones vs. RHN. As illustrated in the fourth column of Table 5, deeper hourglass units are deployed in serial way. We can see that the Deeper-More combinations are also not very effective, since the errors of them are higher than the baseline while RHN’s error is 10 mm lower. It verifies again the effectiveness of the lightweight RHN design.

(2) Effect of Repetitive Guidance Module

(i) Efficient guidance. Note that we directly output the features in EG\(_{3}\) when not employing AF. Tables 1 and 2 have provided quantitative analysis in theory for EG design. Based on (a), we disable G\(_1\) by replacing it with EG\(_{1}\). Comparing (b) with (a) in Table 6, both of which carry out the guided convolution technology only once, although the error of (c) goes down a little bit, the GPU memory is heavily reduced by 11.95 GB. These results give strong evidence that our new guidance design is not only effective but also efficient.

Table 6. Ablation studies of RG/AF on KITTI validation set. RG-EG\(_k\) refer to the case where we repeatedly use EG k times. ‘\(\pm 0\)’ refers to 23.37 GB. G\(_1\) represents the raw guided convolution in GuideNet [47], which is used only once in one fusion stage.
Fig. 6.
figure 6

Visual comparisons of intermediate features of the baseline and our repetition.

(ii) Repetitive guidance. When the recursion number k of EG increases, the errors of (c) and (d) are 6.3 mm and 11.2 mm significantly lower than that of (b) respectively. Meanwhile, as illustrated in Fig. 6, since our repetition in depth (d) can continuously model high-frequency components, the intermediate depth feature possesses more detailed boundaries and the corresponding image guidance branch consistently has a high response nearby the regions. These facts forcefully demonstrate the effectiveness of our repetitive guidance design.

(iii) Adaptive fusion. Based on (d) that directly outputs the feature of RG-EG\(_3\), we choose to utilize all features of RG-EG\(_k\) (\(k=1,2,3\)) to produce better depth representations. (e), (f), and (g) refer to addition, concatenation, and our AF strategies, respectively. Specifically in (f), we conduct a \(3 \times 3\) convolution to control the channel to be the same as RG-EG\(_3\)’s after concatenation. As we can see from the ‘AF’ column of Table 6, all of the three strategies improve the performance of the model with a little bit GPU memory sacrifice (about 0–0.06 GB), which demonstrates that aggregating multi-step features in repetitive procedure is effective. Furthermore, our AF mechanism obtains the best result among them, outperforming (d) 5.3 mm. These facts prove that our AF design benefits the system better than simple fusion strategies. Detailed difference of intermediate features produced by our repetitive design is shown in Figs. 2 and 6.

Fig. 7.
figure 7

Comparisons under different levels of sparsity on KITTI validation split. The solid lines refer to our method while the dotted ones represent other approaches.

Fig. 8.
figure 8

Comparisons with existing methods (left) and itself (right) replacing ‘RG’ with ‘+’, under different lighting and weather conditions on Virtual KITTI test split.

5.6 Generalization Capabilities

In this subsection, we further validate the generalization capabilities of our RigNet on different sparsity, including the number of valid points, various lighting and weather conditions, and the synthetic pattern of sparse data. The corresponding results are illustrated in Figs. 7 and 8.

(1) Number of Valid Points

On KITTI selected validation split, we compare our method with four well-known approaches with available codes, i.e., S2D [33], Fusion [49], NConv [10], and ACMNet [59]. Note that, all models are pretrained on KITTI training split with raw sparsity, which is equivalent to sampling ratios of 1.0, but not fine-tuned on the generated depth inputs. Specifically, we first uniformly sample the raw depth maps with ratios (0.025, 0.05, 0.1, 0.2) and (0.4, 0.6, 0.8, 1.0) to produce the sparse depth inputs. Then we test the pretrained models on the inputs. Figure 7 shows our RigNet significantly outperforms others under all levels of sparsity in terms of both RMSE and MAE metrics. These results indicates that our method can deal well with complex data inputs.

(2) Lighting and Weather Condition

The lighting condition of KITTI dataset is almost invariable and the weather condition is good. However, both lighting and weather conditions are vitally important for depth completion, especially for self-driving service. Therefore, we fine-tune our RigNet (trained on KITTI) on ‘clone’ of Virtual KITTI [11] and test under all other different lighting and weather conditions. As shown in the right of Fig. 8, we compare ‘RG’ with ‘+’ (replace RG with addition), our method outperforms ‘+’ with large margin on RMSE. The left of Fig. 8 further demonstrates that RigNet has better performance than GuideNet [47] and ACMNet [59] in complex environments. These results verify that our method is able to handle polytropic lighting and weather conditions.

In summary, all above-mentioned evidences demonstrate that the proposed approach has robust generalization capabilities.

6 Conclusion

In this paper, we explored the repetitive design in our image guided network for depth completion task. We pointed out that there were two issues impeding the performance of existing outstanding methods, i.e., the blurry guidance in image and unclear structure in depth. To tackle the former issue, in our image guidance branch, we presented a repetitive hourglass network to produce discriminative image features. To alleviate the latter issue, in our depth generation branch, we designed a repetitive guidance module to gradually predict structure-detailed depth maps. Meanwhile, to model high-frequency components and reduce GPU memory consumption of the module, we proposed an efficient guidance algorithm. Furthermore, we designed an adaptive fusion mechanism to automatically fuse multi-stage depth features for better predictions. Extensive experiments show that our method achieves outstanding performances.