Keywords

1 Introduction

Monocular depth estimation is a task to predict a pixel-wise depth map from a single image to understand the 3D geometry of a scene. The distance from a scene point to the camera provides essential information in various applications, including 2D-to-3D image/video conversion [52], augmented reality [35], autonomous driving [8], surveillance [22], and 3D CAD model generation [20]. Since only a single camera is available in many applications, monocular depth estimation, which infers the 3D information of a scene without additional equipment, has become an important research topic.

Recently, learning-based monocular depth estimators using convolutional neural networks (CNNs) have shown significant performance improvements, overcoming the intrinsic ill-posedness of monocular depth estimation by exploiting a huge amount of training data [1, 3, 6, 7, 14, 25, 28, 54, 56]. Existing learning-based monocular depth estimators can be classified into two categories according to the properties of estimated depth maps: relative depth estimation and metric depth estimation. Relative depth estimation predicts the relative depth order among pixels [2, 34, 51, 58]. Metric depth estimation, on the other hand, predicts the absolute distance of each scene point from the camera [3, 6, 14, 25, 54], which is a pixel-wise regression problem.

Fig. 1.
figure 1

Metric depth maps and their normalized depth maps.

To estimate a metric depth map, a network should learn both the 3D geometry of the scene and the camera parameters. This implies that a metric depth estimator should be trained with a dataset obtained by a specific camera. In contrast, a relative depth estimator can be trained with heterogeneous datasets, e.g., disparity maps from stereo image pairs or even manually labeled pixel pairs. Thus, relative depth estimation is an easier task than metric depth estimation is. Moreover, note that the geometry of a scene can be easily estimated when extra cues are available. For example, depth completion [38, 39, 55], which recovers a dense depth map from sparse depth measurements, can be performed more accurately and more reliably than monocular depth estimation is. Based on these observations, metric depth estimation algorithms using relative depths as extra cues have been developed via fitting [34, 43] or fine-tuning [42].

In this paper, we propose a monocular metric depth estimator that decomposes a metric depth map into a normalized depth map and scale features. As illustrated in Fig. 1, a normalized depth map contains relative depth information, and it is less sensitive to scale variations or camera parameters than a metric depth map is. The proposed algorithm consists of a single shared encoder and three decoders, G-Net, N-Net, and M-Net, which estimate gradient maps, a normalized depth map, and a metric depth map, respectively. M-Net learns to estimate metric depth maps using relative depth features extracted by G-Net and N-Net. To this end, we progressively transfer features from G-Net to N-Net and then from N-Net to M-Net. In addition, we develop the mean depth residual (MDR) block for M-Net to utilize N-Net features more effectively. Because the proposed algorithm learns to estimate metric depths by exploiting gradient maps and relative depths, additional datasets containing only relative depths can be used to improve the metric depth estimation performance further. Experimental results show that the proposed algorithm is competitive with state-of-the-art metric depth estimators, even when it is trained with a smaller metric depth dataset.

This paper has the following contributions:

  • We propose a novel monocular depth estimator, which decomposes a metric depth map into a normalized depth map and relative depth features and then exploits those relative features to improve the metric depth estimation performance.

  • The proposed algorithm can be adapted to a new camera efficiently since it can be trained with a small metric depth dataset together with camera-independent relative depth datasets.

  • The proposed algorithm provides competitive performance to conventional state-of-the-art metric depth estimators and can improve the performance further through joint training using multiple datasets.

2 Related Work

2.1 Monocular Metric Depth Estimation

The objective of monocular metric depth estimation is to predict pixel-wise absolute distances of a scene from a camera using a single image. Since different 3D scenes can be projected onto the same 2D image, monocular depth estimation is ill-posed. Nevertheless, active research has been conducted due to its practical importance. To infer depths, early approaches made prior assumptions on scenes, e.g. box blocks [9], planar regions [44], or particular layout of objects [10]. However, they may provide implausible results, especially in regions with ambiguous colors or small objects.

With recent advances in deep learning, CNN techniques for monocular depth estimation have been developed, yielding excellent performance. Many attempts have been made to find better network architecture [3, 6, 14, 25, 54] or to design more effective loss functions [2, 5, 16, 25]. It has been also demonstrated that the depth estimation performance can be improved by predicting quantized depths through ordinal regression [7], by employing Fourier domain analysis [26], by enforcing geometric constraints of virtual normals [56], or by reweighting multiple loss functions [28]. Recently, the vision transformer [4] was employed for monocular depth estimation [1], improving the performance significantly.

2.2 Relative Depth Estimation

The objective of relative depth estimation is to learn the pairwise depth order [58] or the rank of pixels [2, 51] in an image. Recently, listwise ranking, instead of pairwise ranking, was considered for relative depth estimation [34]. Also, scale-invariant loss [6] and its variants [32, 33, 43, 48] have been used to alleviate the scale ambiguity of depths, thereby improving the performance of relative depth estimation.

Unlike metric depths, relative depth information—or depth order information—is invariant to camera parameters. Therefore, even though a training set is composed of images captured by different cameras, it does not affect the performance of relative depth estimation adversely. Therefore, heterogeneous training data, such as disparity maps from stereo image pairs [48, 50, 51] or videos [43], structure-from-motion reconstruction [32, 33], and ordinal labels [2], have been used to train relative depth estimators.

2.3 Relative vs. Metric Depths

A metric depth map contains relative depth information, whereas relative depth information is not sufficient for reconstructing a metric depth map. However, relative-to-metric depth conversion has been attempted by fitting relative depths to metric depths [34, 43] or by fine-tuning a relative depth estimator for metric depth estimation [42].

On the other hand, relative and metric depths can be jointly estimated to exploit their correlation and to eventually improve the performance of metric depth estimation. To this end, ordinal labels are used with a ranking loss in [2]. Also, in [27], relative and metric depth maps at various scales are first estimated and then optimally combined to yield a final metric depth map.

The proposed algorithm also estimates relative depth information, in addition to metric depths, to improve the performance of metric depth estimation. However, differently from [2, 27], the proposed algorithm better exploits the correlation between relative and metric depths by decomposing a metric depth map. Furthermore, the proposed algorithm can provide promising results even with a small metric depth dataset by exploiting a relative depth dataset additionally.

Fig. 2.
figure 2

(a) Overall network architecture of the proposed algorithm and (b) detailed structure of decoders. The proposed algorithm consists of a shared encoder and three decoders: G-Net, N-Net, and M-Net. G-Net predicts horizontal and vertical gradient maps, while N-Net and M-Net estimate normalized and metric depth maps, respectively. Note that G-Net features are fed into N-Net, and N-Net features are fed into M-Net.

3 Proposed Algorithm

Figure 2 is an overview of the proposed algorithm, which consists of a shared encoder and three decoders—G-Net, N-Net, and M-Net. The shared encoder extracts common features that are fed into the three decoders. Then, G-Net predicts horizontal and vertical gradients of depths, while N-Net and M-Net estimate a normalized depth map and a metric depth map, respectively. Note that features extracted by G-Net are fed into N-Net to convey edge information, and those by N-Net are, in turn, fed into M-Net to provide relative depth features. Finally, via the MDR block, M-Net exploits the relative depth features to estimate a metric depth map more accurately.

3.1 Metric Depth Decomposition

Given an RGB image \(I \in \mathbb {R}^{h \times w \times 3}\), the objective is to estimate a metric depth map \(M \in \mathbb {R}^{h \times w}\). However, this is ill-posed because different scenes with different metric depths can be projected onto the same image. Moreover, scale features of depths are hard to estimate from the color information only since they also depend on the camera parameters. To address this issue, we decompose a metric depth map M into a normalized depth map N and scale parameters. The normalized depth map N contains relative depth information, so it is less sensitive to scale variations or camera parameters than the metric depth map M is.

There are several design choices for normalizing a metric depth map, including min-max normalization or ranking-based normalization [11]. However, the min-max normalization is sensitive to outliers, and the ranking-based normalization is unreliable in areas with homogeneous depths, such as walls and floors. Instead, we normalize a metric depth map using the z-score normalization. Given a metric depth map M, we obtain the normalized depth map N by

$$\begin{aligned} N = \frac{M - \mu _{M} U}{\sigma _{M}} \end{aligned}$$
(1)

where \(\mu _M\) and \(\sigma _M\), respectively, denote the mean and standard deviation of metric depths in M. Also, U is the unit matrix whose all elements are 1.

N-Net, denoted by \(f_N\), estimates this normalized depth map, and its estimate is denoted by \(\hat{N}\). When the scale parameters \(\mu _M\) and \(\sigma _M\) are known, the metric depth map M can be reconstructed by

$$\begin{aligned} \hat{M}_{\text {direct}} = \sigma _{M} \hat{N} + \mu _{M} U. \end{aligned}$$
(2)

In practice, \(\mu _M\) and \(\sigma _M\) are unknown. Conventional methods in [34, 42, 43] obtain fixed \(\sigma _{M}\) and \(\mu _{M}\) for all images based on the least-squares criterion. In such a case, the accuracy of \(\hat{M}_{\text {direct}}\) in (2) greatly depends on the accuracy of \(\sigma _{M}\) and \(\mu _{M}\). In this work, instead of the direct conversion in (2), we estimate the metric depth map by employing the features \(\psi _N(I)\), which are extracted by the proposed M-Net, \(f_N\), during the estimation of \(\hat{N}\). In other words, the proposed M-Net, \(f_M\), estimates the metric depth map by

$$\begin{aligned} \hat{M} = f_M(I, \psi _N(I)). \end{aligned}$$
(3)

For metric depth estimation, structural data (e.g. surface normals or segmentation maps) have been adopted as additional cues [5, 29, 37, 40], or relative depth features have been used indirectly via loss functions (e.g. pairwise ranking loss [2] or scale-invariant loss [1, 6, 30]). In contrast, we convert a metric depth map to a normalized depth map. Then, the proposed N-Net estimates the normalized depth map to extract the features \(\psi _N\), containing relative depth information. Then, the proposed M-Net uses \(\psi _N\) for effective metric depth estimation.

Similarly, we further decompose the normalized depth map N into more elementary data: horizontal and vertical gradients. The horizontal gradient map \(G_x\) is given by

$$\begin{aligned} G_{x} = \nabla _x N \end{aligned}$$
(4)

where \(\nabla _x\) is the partial derivative operator computing the differences between horizontally adjacent pixels. The vertical gradient map \(G_y\) is obtained similarly. The proposed G-Net is trained to estimate these gradient maps \(G_x\) and \(G_y\). Hence, G-Net learns edge information in a scene, and its features \(\psi _G\) are useful for inferring the normalized depth map. Therefore, similar to (3), N-Net estimates the normalized depth map via

$$\begin{aligned} \hat{N} = f_N(I, \psi _G(I)) \end{aligned}$$
(5)

using the gradient features \(\psi _G(I)\).

3.2 Network Architecture

For the shared encoder in Fig. 2, we adopt EfficientNet-B5 [47] as the backbone network. G-Net and N-Net have an identical structure, consisting of five upsampling blocks. However, G-Net outputs two channels for two gradient maps \(G_x\) and \(G_y\), while N-Net yields a single channel for a normalized depth map N. M-Net also has a similar structure, except for the MDR block, which will be detailed in Sect. 3.3. MDR predicts the mean \(\mu _M\) of M separately, which is added back at the end of M-Net.

The encoder features are fed into the three decoders via skip-connections [13], as shown in Fig. 2(a). To differentiate the encoder features for the different decoders, we apply \(1 \times 1\) convolution to the encoder features before feeding them to each decoder. Also, we apply the channel attention [15] before each skip-connection to each decoder.

We transfer features unidirectionally from G-Net to N-Net and also from N-Net to M-Net to exploit low-level features for the estimation of high-level data. To this end, we fuse features through element-wise addition before each of the first four upsampling blocks in N-Net and M-Net, as shown in Fig. 2(b). Specifically, let \(\psi _G^\textrm{out}\) and \(\psi _N^\textrm{out}\) denote the output features of G-Net and N-Net, respectively. Then, the input feature \(\psi ^\textrm{in}_N\) to the next layer of N-Net is given by

$$\begin{aligned} \psi ^\textrm{in}_N = \omega _G \psi _G^\textrm{out} + \omega _N \psi _N^\textrm{out} \end{aligned}$$
(6)

where \(\omega _G\) and \(\omega _N\) are pre-defined weights for \(\psi _G^\textrm{out}\) and \(\psi _N^\textrm{out}\) to control the relative contributions of the two features. For M-Net, the features from N-Net are added similarly. In order to fuse features, we use addition, instead of multiplication or concatenation, for computational efficiency.

Fig. 3.
figure 3

Detailed structure of the MDR block.

3.3 MDR Block

We design the MDR block to utilize the features \(\psi _N\) of N-Net more effectively for the metric depth estimation in M-Net. Figure 3 shows the structure of the MDR block, which applies patchwise attention to an input feature map and estimates the mean \(\mu _M\) of M separately using the transformer encoder [4]. Note that the transformer architecture enables us to use one of the patchwise-attended feature vectors to regress \(\mu _M\).

Specifically, MDR first processes input features using an \(8 \times 8\) convolution layer with a stride of 8 and a \(3 \times 3\) convolution layer, respectively. The patchwise output of the \(8 \times 8\) convolution is added to the positional encodings and then input to the transformer encoder [4]. The positional encodings are learnable parameters, randomly initialized at training. Then, the transformer encoder generates 192 patchwise-attended feature vectors of 128 dimensions. We adopt the mini-ViT architecture [1] for the transformer encoder. The first vector is fed to the regression module, composed of three fully-connected layers, to yield \(\mu _M\). The rest 191 vectors form a matrix, which is multiplied with the output of the \(3 \times 3\) convolution layer to generate \(191 \times 96 \times 128\) output features through reshaping. Finally, those output features are fed to the next upsampling block of M-Net. Also, the estimated \(\mu _M\) is added back at the end of M-Net in Fig. 2(b), which makes the remaining parts of M-Net focus on the estimation of the mean-removed depth map \(M-\mu _M U\) by exploiting the N-Net features \(\psi _N\).

3.4 Loss Functions

Let us describe the loss functions for training the three decoders. For G-Net, we use the \(\ell _1\) loss

$$\begin{aligned} \mathcal {L}_G = \frac{1}{T}\big (\Vert \hat{G}_x - G_x\Vert _1 + \Vert \hat{G}_y - G_y\Vert _1\big ) \end{aligned}$$
(7)

where \(\hat{G}_x\) and \(\hat{G}_y\) are predictions of the ground-truth gradient maps \(G_x\) and \(G_y\), respectively. Also, T denotes the number of valid pixels in the ground-truth.

For N-Net, we use two loss terms: the \(\ell _1\) loss and the gradient loss. The \(\ell _1\) loss is defined as

$$\begin{aligned} \mathcal {L}_N = \frac{1}{T}\Vert \hat{N} - N\Vert _1 \end{aligned}$$
(8)

where \(\hat{N}\) and N are predicted and ground-truth normalized depth maps. Note that scale-invariant terms are often adopted to train monocular depth estimators [1, 30, 42]. However, we do not use such scale-invariant losses since normalized depth maps are already scale-invariant. Next, the gradient losses [16, 33, 51] for N in the horizontal direction are defined as

$$\begin{aligned} \mathcal {L}_{Nx} = \frac{1}{T \cdot s^2}\Vert \nabla _x\hat{N}_s - \nabla _x N_s\Vert _1 \end{aligned}$$
(9)

where \(\hat{N}_s\) and \(N_s\) are the bilinearly scaled \(\hat{N}\) and N with a scaling factor s. We compute the gradient losses at various scales, as in [33, 51], by setting s to 0.5, 0.25, and 0.125. The losses \(L_{Ny}\) in the vertical direction are also used.

Similarly, for M-Net, we use the loss terms \(\mathcal {L}_M\), \(\mathcal {L}_{Mx}\), and \(\mathcal {L}_{My}\). In addition, we use two more loss terms. First, \(\mathcal {L}_{\mu _M}\) is defined to train the MDR block, which is given by

$$\begin{aligned} \mathcal {L}_{\mu _M} = \Vert \mu (\hat{M}) - \mu _M\Vert _1 \end{aligned}$$
(10)

where \(\mu (\hat{M})\) denotes the mean of depth values in \(\hat{M}\). Second, we define the logarithmic \(\ell _1\) loss,

$$\begin{aligned} \mathcal {L}_{\log M} = \frac{1}{T}\Vert \log {\hat{M}} - \log {M}\Vert _1. \end{aligned}$$
(11)

In this work, we adopt inverse depth representation of metric depths to match the depth order with a relative depth dataset [51]. In this case, theoretically, a metric depth can have a value in the range \([0, \infty )\). Thus, when a metric depth is near zero, its inverted value becomes too large, which interferes with training. We overcome this problem through a simple modification. Given an original metric depth \(m_o\), its inverted metric depth m is defined as

$$\begin{aligned} m = 1 / (m_o + 1). \end{aligned}$$
(12)

In this way, inverted metric depth values are within the range of (0, 1] and also are more evenly distributed.

However, using the \(\ell _1\) loss \(\mathcal {L}_M\) on inverse depths has a disadvantage in learning distant depths. Suppose that \(\hat{\chi }\) and \(\chi \) are predicted and ground-truth metric depth values for a pixel, respectively. Then, the \(\ell _1\) error E is given by

$$\begin{aligned} E = \bigg | \frac{1}{\hat{\chi }} - \frac{1}{\chi }\bigg |. \end{aligned}$$
(13)

As \(\chi \) gets larger, E becomes smaller for the same \(| \hat{\chi } - \chi |\). This means that the network is trained less effectively for distant regions. This problem is alleviated by employing \(\mathcal {L}_{\log M}\) in (11).

4 Experimental Results

4.1 Datasets

We use four depth datasets: one for relative depths [51] and three for metric depths [23, 45, 46]. When relative depth data are used in training, losses are generated from the loss terms for N-Net and G-Net only because the loss terms for M-Net cannot be computed.

HR-WSI [51]: It consists of 20,378 training and 400 test images. The ground-truth disparity maps are generated by FlowNet 2.0 [19]. We use only the training data of HR-WSI. We normalize the disparity maps by (1) and regard them as normalized depth maps.

NYUv2 [45]: It contains 120K video frames for training and 654 frames for test, together with the depth maps captured by a Kinect v1 camera. We use the NYUv2 dataset for both training and evaluation. We construct three training datasets of 51K, 17K, and 795 sizes. Specifically, we extract the 51K and 17K images by sampling video frames uniformly. For the 795 images, we use the official training split of NYUv2. We fill in missing depths using the colorization scheme [31], as in [45].

DIML-Indoor [23]: It consists of 1,609 training images and 503 test images, captured by a Kinect v2 camera.

SUN RGB-D [46]: It consists 5,285 training images and 5,050 test images, obtained by four different cameras: Kinect v1, Kinect v2, RealSense, and Xtion.

4.2 Evaluation Metrics

Metric Depths: We adopt the four evaluation metrics in [6], listed below. Here, \(M_i\) and \(\hat{M}_i\) denote the ground-truth and predicted depths of pixel i, respectively. \(|\cdot |\) denotes the number of valid pixels in a depth map. For the NYUv2 dataset, we adopt the center crop protocol [6].

$$\begin{aligned} \textstyle \text {RMSE}:&\ \ \frac{1}{|M|}\big (\sum _{i}(\hat{M}_i-M_i)^2\big )^{0.5} \\ \text {REL}:&\ \ \frac{1}{|M|}\sum _{i} \vert \hat{M}_i-M_i \vert /M_i \\ \log 10:&\ \ \frac{1}{|M|}\sum _{i} \vert \log _{10}(\hat{M}_i)-\log _{10}(M_i) \vert \\ \delta _k:&\ \ \% \text { of } M_i \text { that satisfies} \ \max \!\left( \frac{\hat{M}_i}{M_i},\frac{M_i}{\hat{M}_i}\right) \!<1.25^k, \ \ k \in \{1, 2, 3\} \end{aligned}$$

Relative Depths: we use two metrics for relative depths. First, WHDR (weighted human disagreement rate) [2, 43, 51] measures the ordinal consistency between point pairs. We follow the evaluation protocol of [51] to randomly sample 50,000 pairs in each depth map. However, WHDR is an unstable protocol, under which the performance fluctuates with each measurement. We hence use Kendall’s \(\tau \) [21] additionally, which considers the ordering relations of all pixel pairs. Given a ground-truth normalized depth map D and its prediction \(\hat{D}\), Kendall’s \(\tau \) is defined as

$$\begin{aligned} \tau {(\hat{D}, D)} = \frac{\alpha (\hat{D}, D) - \beta (\hat{D}, D)}{\left( {\begin{array}{c}|D|\\ 2\end{array}}\right) } \end{aligned}$$
(14)

where \(\alpha (\hat{D}, D)\) and \(\beta (\hat{D}, D)\) are the numbers of concordant pairs and discordant pairs between D and \(\hat{D}\), respectively. Note that Kendall’s \(\tau \) can measure the quality of a metric depth map, as well as that of a relative one.

Table 1. Comparison of depth estimation results on the NYUv2 dataset. ‘#’ is the number of training images, and \(\dagger \) means that additional data is used for training. The best results are boldfaced. Lower RMSE, REL, and \(\log 10\) indicate better results, while higher \(\delta _k\) values are better ones.
Fig. 4.
figure 4

Qualitative comparison of the proposed algorithm with the conventional algorithms. For each depth map, the error map is also provided, in which brighter pixels correspond to larger errors.

4.3 Implementation Details

Network Architecture: We employ EfficientNet-B5 [47] as the encoder backbone. The encoder takes an \(512 \times 384\) RGB image and generates a \(16 \times 12\) feature with 2,048 channels. The output feature is used as the input to the three decoders. G-Net and N-Net consist of 5 upsampling blocks, each of which is composed of a bilinear interpolation layer and two \(3\times 3\) convolution layers with the ReLU activation. Also, in addition to the 5 upsampling blocks, M-Net includes the MDR block, located between the fourth and fifth upsampling blocks. For feature fusion in (6), \(\omega _G = \omega _N = 1\).

Training: We train the proposed algorithm in two phases. First, we train the network, after removing M-Net, for 20 epochs with an initial learning rate of \(10^{-4}\). The learning rate is decreased by a factor of 0.1 at every fifth epoch. Second, we train the entire network, including all three decoders, jointly for 15 epochs with an initial learning rate of \(10^{-4}\), which is decreased by a factor of 0.1 at every third epoch. We use the Adam optimizer [24] with a weight decay of \(10^{-4}\). If a relative depth is used in the second phase, losses are calculated from the loss terms for N-Net and G-Net only.

4.4 Performance Comparison

Table 1 compares the proposed algorithm with conventional ones on NYUv2 dataset. Some of the conventional algorithms use only NYUv2 training data [1, 3, 6, 7, 12, 15, 18, 25, 28, 30, 56], while the others use extra data [41, 42, 49]. For fair comparisons, we train the proposed algorithm in both ways: ‘Proposed’ uses NYUv2 only, while ‘Proposed\(^\dagger \)’ uses both HR-WSI and NYUv2. The following observations can be made from Table 1.

Table 2. Comparison of depth estimation results on various datasets. ‘#’ is the size of metric depth dataset, and \(\dagger \) means that 20K HR-WSI data are additionally used for training. A lower Kendall’s \(\tau \) indicates a better result.
Fig. 5.
figure 5

Qualitative comparison of the proposed algorithm with the baseline using the NYUv2 (795), DIML-Indoor, and SUN RGB-D datasets.

  • ‘Proposed’ outperforms all conventional algorithms in all metrics with no exception. For example, ‘Proposed’ provides a REL score of 0.100, which is 0.003 better than that of the second-best algorithm, Bhat et al. [1]. Note that both algorithms use the same encoder backbone of EfficientNet-B5 [47].

  • ‘Proposed\(^\dagger \)’ provides the best results in five out of six metrics. For \(\delta _2\), the proposed algorithm yields the second-best score after Ranftl et al. [42]. It is worth pointing out that Ranftl et al. uses about 20 times more training data than the proposed algorithm does.

Figure 4 compares the proposed algorithm with the conventional algorithms [1, 3, 6, 7, 30, 42] qualitatively. We see that the proposed algorithm estimates the depth maps more faithfully with smaller errors.

Table 3. Ablation studies of the proposed algorithm using the NYUv2 (17K) dataset.

4.5 Various Datasets

Table 2 verifies the effectiveness of the proposed algorithm on various datasets. The first two columns in Table 2 indicate the metric depth dataset and its size. ‘Baseline’ is a bare encoder-decoder for monocular depth estimation. Specifically, we remove G-Net and N-Net, as well as the MDR block in M-Net, from the proposed algorithm in Fig. 2 to construct ‘Baseline.’ For its training, only three loss terms \(\mathcal {L}_M\), \(\mathcal {L}_{Mx}\), and \(\mathcal {L}_{My}\) are used. ‘Proposed’ means the proposed algorithm without employing the 20K HR-WSI training data, while ‘Proposed\(^\dagger \)’ means using the HR-WSI data additionally. The following observations can be made from Table 2.

  • By comparing ‘Proposed’ with ‘Baseline,’ we see that G-Net and N-Net help M-Net improve the performance of metric depth estimation by transferring edge information and relative depth information. Also, ‘Proposed\(^\dagger \)’ meaningfully outperforms ‘Proposed’ by leveraging relative depth training data in HR-WSI, which contain no metric depth labels.

  • Even when only the 795 NYUv2 images are used, the proposed algorithm provides acceptable results. For example, the RMSE score of 0.417 is similar to that of the Hyunh et al.’s estimator [18] in Table 1, which uses 50K metric depth map data. In contrast, the proposed algorithm uses the 795 metric depth maps only.

  • The proposed algorithm also exhibits similar trends in the DIML-Indoor and SUN RGB-D datasets, which are collected using different cameras: the proposed algorithm can be trained effectively even with a small number of metric depth images. This is advantageous in practical applications in which an algorithm should be adapted for various cameras.

Figure 5 compares ‘Baseline’ and ‘Proposed\(^\dagger \)’ qualitatively using the 795 NYUv2, DIML-Indoor, and SUN RGB-D datasets. For all datasets, ‘Proposed\(^\dagger \)’ provides more accurate and more detailed depth maps, especially around chairs, tables, and desks, than ‘Baseline’ does.

4.6 Analysis

Table 4. Effectiveness of the two-phase training scheme.
Table 5. Complexity comparison.
Table 6. Complexity of each component.

Ablation Studies: We conduct ablation studies that add the proposed components one by one in Table 3. Here, the 17K images from NYUv2 are used for training. M, N, and G denote the three decoders. MDR\(^*\) is the MDR block with \(\mu _M\) deactivated. \(\dagger \) indicates the use of relative depth data. We see that all components lead to performance improvements, especially in terms of the two relative depth metrics Kendall’s \(\tau \) and WHDR.

Table 4 shows the effectiveness of the two-phase training scheme of the proposed algorithm. The proposed algorithm, which trains G-Net and N-Net first, shows better results than the single-phase scheme, which trains the entire network at once.

Complexities and Inference Speeds: Table 5 compares the complexities of the proposed algorithm and the Ranftl et al.’s algorithm [42]. The proposed algorithm performs faster with a smaller number of parameters than the Ranftl et al.’s algorithm [42] does. This indicates that the performance gain of the proposed algorithm is not from the increase in complexity but from the effective use of relative depth features. Table 6 lists the complexity of each component of the proposed algorithm. The encoder spends most of the inference time, while the three decoders are relatively fast.

5 Conclusions

We proposed a monocular depth estimator that decomposes a metric depth map into a normalized depth map and scale features. The proposed algorithm is composed of a shared encoder with three decoders, G-Net, N-Net, and M-Net, which estimate gradient maps, a normalized depth map, and a metric depth map, respectively. G-Net features are used in N-Net, and N-Net features are used in M-Net. Moreover, we developed the MDR block for M-Net to utilize N-Net features and improve the metric depth estimation performance. Extensive experiments demonstrated that the proposed algorithm provides competitive performance and yields acceptable results even with a small metric depth dataset.