1 Introduction

Given a single input RGB image, monocular depth estimation (MDE) (Zhao et al., 2020; Ming et al., 2021; Masoumian et al., 2022) aims at estimating the corresponding depth map. As a fundamental yet challenging task in computer vision, MDE has various downstream applications such as autonomous driving (Geiger et al., 2013; Cordts et al., 2016), visual odometry (Zhan et al., 2018), special effects (Luo et al., 2020), and 3D reconstructions (Kopf et al., 2021; Xu et al., 2023; Yin et al., 2023); application scenarios span from indoor (Ji et al., 2021; Li et al., 2021; Wu et al., 2022) to outdoor (Vyas et al., 2022). Recently, the task has been greatly advanced thanks to the development of deep neural network architectures, evolving from the convolutional neural networks (Alhashim & Wonka, 2018; Lee et al., 2019) to the vision transformer (Ranftl et al., 2021; Yuan et al., 2022).

However, the success of current deep learning-based MDE methods is generally limited to a single dataset due to the significant domain shifts across different datasets. For example, the KITTI dataset (Geiger et al., 2013) and the VIPER dataset (Richter et al., 2017) concentrate on real-world urban driving scenarios while the Sintel dataset (Butler et al., 2012) contains synthetic movies. Therefore, some methods that are state-of-the-art (SOTA) on one dataset often cannot achieve comparable performance on another dataset without substantial adaptation. In practice, the deep MDE network models can overfit in scene contents, focal length, image sizes or depth sources (Facil et al., 2019).

Towards real-world applications across diverse scenes, a robust MDE model should generalize well across different scenarios without adaptation. Thus, we need to push methods to be robust and perform well across different datasets with fixed model parameters and hyperparameters. A unified network to solve more real-world monocular depth estimation problems is of high practical value and is in urgent need.

Fig. 1
figure 1

Performance comparison between our method and SOTA methods on three datasets in terms of the SILog metric. Our method “Megatron_RVC” consistently outperforms all the competing methods and wins the challenge

Recently, robust MDE has gained a lot of attention. The ability of a robust MDE model makes it applicable in diverse situations, which provides out-of-the-box MDE capability for the community. These methods mainly seek solution in vision backbones (Ranftl et al., 2021) or diverse data collections, whether from large-scale web stereo data (Xian et al., 2018; Wang et al., 2019), human annotations (Chen et al., 2016) or Structure-from-Motion reconstructions (Li et al., 2018, 2019).

To foster the development of vision systems that are robust and consequently perform well on a variety of datasets with different characteristics, the Robust Vision Challenge (RVC) has been established (Fig. 1). The performance is measured across a number of challenging benchmarks with different characteristics, e.g., indoors versus outdoors, real versus synthetic, sunny versus bad weather, and different sensors (http://www.robustvision.net).

A robust monocular depth estimation model can not only be applicable to diverse real-world applications, achieving stronger transfer ability when be adapted to a specific domain, but also empower various downstream depth-related tasks (Xu et al., 2023; Yin et al., 2023; Zhan et al., 2018; Luo et al., 2020; Wang et al., 2020). However, practices towards training a robust MDE model have not been fully explored, which are mainly due to the difficulties in designing unified network architecture and effective training strategies. Furthermore, given massive amount of available open-sourced datasets, how to better explore their potentials is a problem that deserves in-depth studies.

In this paper, we rethink the essential ingredients towards a robust deep learning system and propose to tackle the task of robust and unified MDE from three perspectives: network architecture, training strategy and dataset. We name our method Megatron_RVC, an MDE model that performs constantly well on different benchmarks and is applicable in daily life. We explain the above three perspectives in details.

First, we exploit current SOTA vision backbones and present a unified network architecture, which not only is of high capacity, but also consists of components tailored for robust applications. We adopt a VQVAE module to provide content-aware embeddings. Furthermore, a convex upsampling module is utilized to improve the richness of the depth prediction details.

Second, we propose a novel multi-dataset mix training strategy called “Random Iterator Selection”, which supports the native resolution and tailored loss functions for each of the datasets used. This strategy can avoid data bias among multiple datasets and is of high efficiency, especially when training in parallel with multiple GPUs.

Third, we collect millions of publicly available samples from multiple sources to provide supervision for our model, where the depth maps are either from depth sensors, stereo matching, multi-view reconstructions, synthetically rendered, or distilled from state-of-the-art MDE methods.

With such a large amount of data coupled with our mix training strategy, our unified network architecture with a large capacity of parameters can achieve robust performance across diverse scenarios. Our unified model ranked consistently first across three benchmarks (KITTI, MPI Sintel, and VIPER) among all participants and won the MDE track at the RVC challenge 2022.

Our main contributions are summarized as:

  1. (1)

    We presented a network architecture that contains components tailored for robust MDE. We adopt a VQVAE module to provide content-aware embeddings, and a convex upsampling module to improve the richness of details of the depth prediction.

  2. (2)

    We proposed a multi-dataset mix training strategy called “Random Iterator Selection”, which supports the native resolution and tailored loss function for each dataset.

  3. (3)

    We collected millions of publicly available samples from multiple sources to provide supervision for our model.

  4. (4)

    Our method outperforms competing SOTA ones across different datasets and wins the monocular depth estimation track at the RVC challenge 2022.

2 Related Work

In this section, we briefly review related work in monocular depth estimation, large scale depth datasets and domain adaptation for robust MDE.

2.1 Monocular Depth Estimation

The task of Monocular Depth Estimation (MDE) (Zhao et al., 2020; Ming et al., 2021; Masoumian et al., 2022; Vyas et al., 2022) aims to predict the metric depth or relative distance given an input image. According to the source of supervision, monocular depth estimation can be roughly divided into three categories: supervised, self-supervised, and weakly supervised. The supervised MDE (Fu et al., 2018; Bhat et al., 2021; Yuan et al., 2022; Abdulwahab et al., 2022) methods directly learn the image-to-depth mapping from ground truth depth maps. Without needing ground truth depth, the self-supervised MDE methods (Masoumian et al., 2023; He et al., 2022; Zhou & Dong, 2022; Zhao et al., 2022a, b) learn to predict depth from left-right consistency (Godard et al., 2017) in stereo images or monocular video sequences (Zhou et al., 2017; Godard et al., 2019), where the supervision is achieved through view synthesis. The weakly-supervised MDE methods (Chen et al., 2016; Ren et al., 2020) learn the relative distance relationships from human annotations. In this paper, we confine our discussions to supervised MDE methods while the principle could be extended to other settings.

The supervised MDE methods generally outweigh the self-supervised and weakly-supervised methods due to the following aspects: (1) empirically, the supervised methods usually outperform the unsupervised and weakly-supervised ones in accuracy (Zhao et al., 2020; 2) theoretically, the supervised methods can make the model more robust in terms of achieving metric depth estimation and dealing with dynamic objects (Ming et al., 2021; 3) practically, the supervised methods are more parameter-effective and can incorporate training data from more diverse sources.

Recent years have witnessed tremendous progress in MDE, where various kinds of deep neural networks have been proposed. After Eigen et al.’s (2014) seminal work in utilizing the deep neural network for MDE, the ever-improving backbone networks have nourished many MDE models. For example, ResNet (He et al., 2016) has been exploited in (Laina et al., 2016) by Laina et al. (2017) while DenseNet has been exploited by DenseDepth (Alhashim & Wonka, 2018) and BTS (Lee et al., 2019). Recently, the network architecture EfficientNet (Tan et al., 2019) has been utilized by AdaBins (Bhat et al., 2021) to achieve accurate depth estimation. Very recently, with the development of vision transformers, the accuracy of MDE methods has been further improved and remarkable performance has been achieved. The Vision Transformer backbone ViT (Dosovitskiy et al., 2021) has been extended to MDE by DPT (Ranftl et al., 2021) while the Swin transformer (Liu et al., 2021) backbone has been utilized in NeWCRFs (Yuan et al., 2022). The backbone network plays a significant role in the ever-increasing performance of state-of-the-art MDE methods. It is worth noting that most of the above success is limited to a single dataset, i.e., different network models should be trained for each dataset separately.

Robust MDEs are usually achieved from the aspect of data. MiDaS (Ranftl et al., 2020) utilizes nearly 2 million samples to train an MDE model that can produce scale-invariant inverse depth predictions. Such ability is further improved in Ranftl et al. (2021) by switching to using vision transformer (Dosovitskiy et al., 2021) as the backbone network. The Mannequin Challenge (Li et al., 2019) and MegaDepth (Li et al., 2018) are large-scale depth datasets reconstructed from internet videos and image collections using the Structure-from-Motion technique, which mainly focus on the depth of human and buildings. The RedWeb (Xian et al., 2018) and WSVD (Wang et al., 2019) are two large-scale datasets consisting of stereo images and videos, where ordinal relationships can be extracted. Some methods utilize large-scale internet photo and video collections, obtain the depth information through Structure-from-Motion (Li et al., 2019, 2018) or stereo-matching (Xian et al., 2018; Wang et al., 2019), and train an MDE model towards diverse scenes. The pre-trained models of these methods provide the out-of-the-box capability of estimating depth from a single image, which greatly facilitate the prosperity of the 3D vision field. However, among these works, the strategies of efficiently mixing multiple datasets are not fully explored, especially when facing datasets with characteristics that vary greatly.

2.2 Large Scale Depth Datasets

A robust MDE model is expected to provide consistently good predictions under different scenarios, thus a dataset that contains diverse scenes is helpful for building MDE models with high generalization ability. Most models trained on specific datasets are difficult to generalize to unconstrained scenarios due to the strong data bias (Torralba & Efros, 2011). Commonly used datasets target a single scenario or topic. For example, the NYU dataset (Silberman et al., 2012) includes 1,449 indoor scenes, where the ground truth depth maps are captured by the Kinect sensor. The KITTI dataset (Geiger et al., 2013) concentrates on the urban autonomous driving scenarios, which mainly contains road scenes captured by cameras and a Lidar sensor mounted on the car. Similarly, Cityscapes (Cordts et al., 2016) consists of street scenes, and the depth information comes from stereo cameras. Make3D (Saxena et al., 2008) mainly consists of outdoor scenes of university campuses and DIODE (Vasiljevic et al., 2019) consists of more indoor and outdoor data, using Lidar to obtain dense depth maps.

Large-scale datasets containing more diverse scenarios obtain data from the Internet. Chen et al. (2016) presented the Depth in the Wild (DIW) dataset consisting of 495k web images annotated with relative depth pairs. Megadepth (Li et al., 2018) consists of 129K outdoor images collected from the Internet, where the depth maps are reconstructed using SfM (Schonberger & Frahm, 2016) techniques and are up-to-scale. ReDWeb (Xian et al., 2018), Holopix50k (Hua et al., 2020) and HRWSI (Xian et al., 2020) are stereo datasets containing more diverse daily scenes. Because these datasets cannot provide the ground truth depth captured by the sensors, they usually need pre- and post-processing steps to obtain reliable depth.

Restricted by imaging conditions, sensor application scopes and limited filming scenarios, real-world captured data could have its limitations. Therefore, some datasets (Hurl et al., 2019; Richter et al., 2017; Gaidon et al., 2016) contain synthetically rendered data from games and virtual engines are proposed. These datasets ease the problem of collecting data, but may suffer from the Synthetic to real gap (Zhao et al., 2019) when generalizing to real-world data.

Fig. 2
figure 2

The network architecture of our method. a overview of our network model; b the encoder architecture; c the attention module in Swin blocks; d the attention module in NeWCRFs blocks; e the architecture of the depth head. Figures are adapted from Liu et al. (2021, 2022) and Yuan et al. (2022)

2.3 Domain Adaptation

Due to the difference in probability distribution between training data and testing data, the performance of the model is often deteriorated due to domain distribution gaps (Quinonero-Candela et al., 2008). Therefore, it is important to improve the model’s generalization ability.

Domain adaptation is to maximize the performance of the model in a known target domain by using the existing source domain. For domain adaptation methods for MDE, (Atapour-Abarghouei & Breckon, 2018) proposed a two-stage method, the first stage uses synthetic data to train an MDE model, and the second stage trains the model to transfer the style of synthetic data to real-world data. A twin pipeline training framework named \(\mathrm {T^2Net}\) is proposed in Zheng et al. (2018), where a synthetic-to-realistic translation network and a task network for MDE learn jointly. Zhao et al. (2019) trained two symmetric style translation networks and two MDE networks in an end-to-end framework, learning from the ground truth labels in the synthetic domain and epipolar geometry in the synthetic domain.

Furthermore, by broadening the scope of the training data to include more scenarios, it is possible to increase the model’s generalization ability, such as training on large-scale and diverse datasets. Existing work (Yin et al., 2021; Xu et al., 2022; Yin et al., 2021) train MDE by mixing high, medium and low quality data in the same proportion in each batch. DeMoN (Ummenhofer et al., 2017), CAM-Convs (Facil et al., 2019), MiDaS (Ranftl et al., 2020) utilized training data from multiple datasets to train an MDE network, where the mixing training strategy is essential toward an unbiased MDE model.

3 Overview

In this section, we present our solution “Megatron_RVC” for unified and robust MDE. We tackle the challenge from three perspectives, namely network architecture, training strategy and dataset, which are widely recognized as the essential ingredients towards a practical deep learning system.

First, in Sect. 4, we exploit current SOTA vision backbones and present a unified network architecture, which consists of components tailored for robust applications. We adopt a VQVAE module to provide content-aware embeddings and a convex upsampling module to improve the richness of the depth prediction.

Second, we present our novel multi-dataset mix training strategy “Random Iterator Selection” in Sect. 5, which enables us to train a robust single model with different datasets. To resolve the issue of forgetting in iteratively training multiple datasets, we propose to use the mix training strategy that supports various resolutions, assigns the proper loss function to use, and can reduce the dataset bias (Torralba & Efros, 2011).

Third, we discuss our diverse dataset that covers as rich contents as possible to provide a wider range of knowledge in Sect. 6. We have collected over 1 million samples from multiple datasets to guarantee a training dataset with high diversity.

4 Network Architecture

To achieve robust MDE, the design of the network architecture deserves extra consideration. Modern backbone networks could benefit downstream tasks to a large extent, and the entire network needs a large number of parameters to produce accurate depth prediction when applied to different scenarios.

We adopt the architecture of NeWCRFs (Yuan et al., 2022) as our baseline: Swin transformer V1 (Liu et al., 2021) is adopted as the backbone, a pyramid pooling module (Zhao et al., 2017) processes features from the bottleneck layer, and the NeWCRFs decoder predicts the depth map. We add a VQVAE module (Van Den & Vinyals, 2017) at the bottleneck of the network to provide content-aware embeddings. Furthermore, a convex upsampling module like (Teed & Deng, 2020) is used to upsample the final depth map to match the size of the input images. The convex upsampling module is adopted to replace the 4\(\times \) bilinear sampling, which produces depth predictions with richer details. The network architecture is illustrated in Fig. 2. Below, we introduce each module in detail.

4.1 Encoder Network

We adopt the Swin transformer V1 as our encoder network. As shown in Fig. 2b, the encoder consists of four stages. At the beginning of each stage, the feature map is gradually downsampled through a linear embedding layer, followed by several Swin transformer blocks. Each transformer block contains two consecutive modules, called the Window-based Self-Attention and the Shifted Window based Self-Attention, as shown in Fig. 3. The architecture of the self-attention module is illustrated in Fig. 2c.

We chose the Swin-L as our backbone network, which is initialized using the the parameters pre-trained on ImageNet 22K (Deng et al., 2009) with the resolution of 224\(\times \)224.

Fig. 3
figure 3

The structure of the window-based self-attention and the shifted one

4.2 Decoder Network

The structure of the decoder network highly resembles that of the encoder network. The basic mechanism of the attention operation is shown in Fig. 2, where the feature from the last decoder layer takes the place of value, while query and key are processed from the feature from the encoder layer.

4.3 The VQVAE Module

To provide content-aware embeddings without explicitly training the network to identify images from different domains, we adopt the VQVAE (Van Den & Vinyals, 2017) module at the bottleneck of the network, which is followed by a PPM head (Zhao et al., 2017) in Yuan et al. (2022).

The features \(c_i\) in the codebooks are initialized randomly. For each of the feature vector \(z_i'\) in the encoded feature map, the closest feature \(c_i'\) is selected to replace it, and we train our network to narrow down the distance between \(z_i'\) and \(c_i'\), and we also encourage diversity among features in the codebook \(c_i\). Details can be found in Sect. 5.6.

4.4 Depth Prediction Head

We use SoftPlus as the final activation function to produce positive values representing metric depth estimations. The advantages of adopting SoftPlus as the activation function over using Sigmoid is that the network can produce positive predictions without upper bounds. This is beneficial under the multi-dataset setting, since their depth can have different maximum values.

5 Training Strategy

Most MDE methods focus on achieving accurate depth predictions on a single dataset, where images and depth maps are usually captured with a single set of devices, and the scenes are with similar contents. However, when training with data from multiple datasets, the situations are quite different. Given data from multiple domains, images could differ in aspect ratios, imaging styles, contents and themes while depth maps could also differ in sparsity and scales. In this section, we provide our training strategy towards robust MDE.

5.1 The Training Cycle

During our early experiments, we found that training a network with each dataset one after another will lead to the forgetting issue (Zhang et al., 2020). A network can achieve the best accuracy on a certain dataset after training with it, but after switching to training on another one that contains less similar scenes to the former one, the performance of validation on the first dataset will decrease dramatically. In other words, the model cannot achieve consistently good performance on all datasets.

To resolve this problem, an intuitive solution is to switch more frequently between datasets, and an ideal situation is that we can randomly sample images from different datasets in every single batch. However, by doing so, a prerequisite is that images should have the same sizes, thus they can be formed into a batch. This is feasible if we resize and crop the image and depth map into patches with the same size, and then randomly mix multiple datasets. However, the ill-posedness of the MDE task poses higher requirements on the resolution of images, i.e., it may bring ambiguities and hinder the training process.

To conduct analysis, we train our network on two datasets, KITTI and NYU. After training on a dataset for one epoch, we switch to training on the other one. Evaluations on both datasets are done periodically to monitor the training process. The absolute relative error on the testing dataset is reported in Fig. 4. Our network fits on the two datasets at different time steps, and after starting the training process on another dataset, the performance drops dramatically. This indicates that the network can forget the learned dataset-specific knowledge very soon, especially for a heavily data-dependent task like MDE.

Fig. 4
figure 4

Evaluation error on KITTI and NYU when training iteratively between two datasets, the switching frequency is every 1 epoch

In order to prevent the network from forgetting too fast, we increase the switching frequency, and show the evaluation results on the KITTI dataset in Fig. 5. As the switching frequency increases, we observe that the upper bound of the evaluation error decreases as the training process goes on, which indicates that the network forgets less when being reminded more often.

Fig. 5
figure 5

Evaluation error of KITTI when training iteratively between two datasets with different switching frequencies, e.g., every 1/8 epoch

5.2 Resolution Overfitting

As discussed in Miangoleh et al. (2021), the MDE networks are sensitive to changes in the resolution of input images. The diversity of image resolution in multiple datasets also requires training in the native size, e.g., the images in the KITTI dataset (Geiger et al., 2013) are almost 3 times wide as images from the NYU dataset (Silberman et al., 2012).

In Fig. 6, we demonstrate several toy examples to illustrate the effect brought by resizing and cropping. We train the network with different (a) cropping and (b) resizing configurations on KITTI, and show the evaluation results periodically. Figure 6a shows that training with the native resolution will lead to the highest accuracy under certain time steps, but samples after random croppings without changing the aspect ratios of image contents can make the model eventually converge to similar degrees of accuracy. In Fig. 6b, we can observe that resizing the image to lower resolutions can harm the model accuracy severely.

We can draw a conclusion that for the most of the efficiency and accuracy, it is best to train MDE models with images of native sizes.

Fig. 6
figure 6

Absolute relative error evaluated on KITTI periodically when training with different cropping and resizing configurations

5.3 Random Iterator Selection

Discussions in Sect. 5.1 reveal that to train an MDE network on a mixed dataset, the ideal solution is to equally have images from each dataset in one batch. Section 5.2 reveals that it is better to train on a dataset with the native resolution and aspect ratio, however, this puts us in a dilemma where images with different resolutions cannot form a batch.

Switching between datasets more frequently is an alternative, and the extreme condition is to randomly choose another dataset at every step. When training with multiple GPUs, this randomness can be further extended to each process, which can be an approximation to having samples with different sizes in one batch.

Since we train our model with multiple GPUs in parallel, each GPU receives a batch of data of the same size, but samples in different batches do not necessarily need to be of the same size. Thus for each GPU process, we randomly choose a batch of images from a dataset. For the most of the efficiency, we assign different batch sizes, e.g., if the images are small, we increase the batch size as long as they fit into the memory.

We name our method as Random Iterator Selection. In practice, we randomly select data iterators per GPU process when training using the Data Distributed Parallel model on multiple GPUs. Using this technique, each GPU is randomly assigned a batch of images with the same size from a dataset independently, and multiple GPUs can have different data allocations, which is equivalent to having inputs from randomly chosen datasets, with different sizes considering all batches when accumulating gradients.

The above pipeline also allows us to effectively choose the appropriate loss function to apply, which we will introduce in Sect. 5.4. We store these iterators as values in a dictionary, and their names are the keys. We randomly sample one key-value pair for each step, so that we can identify them using the keys and fetch data from the iterator. Since each GPU receives samples from one dataset, the attributes of their depth maps are identical, then each process only needs to choose one loss function and apply it to all samples in one GPU batch.

Since we do not want datasets with huge capacities to dominate the training process, we manually reduce the probability that a large dataset can be selected, shown as Sub in Table 1. The intention of our multi-resolution mixing strategy resembles that of the multi-scale sampler in Mehta and Rastegari (2021), but we provide a simpler implementation, which supports sample rate adjustments and dedicated loss function allocation.

Table 1 Datasets used for network training, which cover both indoor and outdoor scenes
Fig. 7
figure 7

Overview of the datasets used for training our unified network model. The area of each block corresponds to the equivalent capacity after selection probability reducing introduced in Sect. 5.3 in log-scale

5.4 Multi-Loss Function Training

Existing MDE datasets provide depth supervision in different forms (cf.  Table 1), thus we have to assign different loss functions for different datasets based on the property of their depth maps. For datasets with absolute scale (denoted as Metric in Table 1), we use the SILog loss \(\mathcal L_{Log}\) as in Yuan et al. (2022),

$$\begin{aligned} \mathcal L_\text {Log} = \sqrt{\frac{1}{K} \sum _i \Delta d_i^2 - \frac{\lambda }{K^2} \left( \sum _i \Delta d_i \right) ^2}. \end{aligned}$$
(1)

For depth prediction \(\{\hat{d}_i\}^K\) and ground truth depth \(\{d_i^*\}^K\), where K is the total number of pixels in an image, \(\Delta d_i = \log \hat{d}_i - \log d_i\) is the per-pixel log difference, and \(\lambda =0.85\) makes the SILog loss not absolutely scale invariant. For datasets with affine ambiguity (denoted as Affine), we use affine invariant loss \(\mathcal L_\text {Affine}\) as in Ranftl et al. (2021),

$$\begin{aligned} \mathcal L_\text {Affine} = \frac{1}{K} \sum _i \rho _i \left| \hat{d}_i^* - d_i^* \right| , \end{aligned}$$
(2)

where

$$\begin{aligned} d_i^*= \frac{d_i-\text {median}(\{d_i\})}{1/K |\sum _i(d_i-\text {median}(\{d_i\})|}, \end{aligned}$$
(3)

is the normalized ground truth depth, so as for depth prediction \(\hat{d}_i^*\), and \(\rho _i\) makes the loss function ignores the top 20% items. For inverse-depth predictions \(\{p_i\}^K\) from DPT (Ranftl et al., 2021) (denoted as Inv-affine), we convert our depth into inverse-depth (\(\hat{p}_i= 1/\hat{d}_i\)), and \(\mathcal L_\text {InvAffine}\),

$$\begin{aligned} \mathcal L_\text {InvAffine} = \frac{1}{K} \sum _i \left| \hat{p}_i^* - p_i^* \right| , \end{aligned}$$
(4)

where the difference between the two normalized inverse-depth maps are measured.

Table 2 Performance comparison between our solution and competing methods across different metrics
Fig. 8
figure 8

Visual comparisons between methods of participants. Figures are taken from official benchmark of Sintel-depth

5.5 Training Steps

We use 6 RTX 3090 GPUs to train our model in parallel, and the entire training process requires approximately 600 GPU hours.

We first verify the effectiveness of the network architecture by training and evaluating on the KITTI (Geiger et al., 2013) Eigen split (Eigen et al., 2014). During this period, we follow the settings of Yuan et al. (2022) and train the network for 50 epochs, the model with the best SILog is kept for the next step. This process takes around 100 GPU hours.

Then, we implement a large-scale pretraining, where images are randomly cropped into smaller and fixed sizes. We use all datasets in Table 1 except KITTI, and load the parameters from the last step. The performance is tracked by evaluating on the KITTI Eigen split periodically. The model that achieves the best SILog metric on KITTI is kept for the next step, and this process takes around 500 GPU hours.

Finally, we finetune our model using KITTI, Sintel (Butler et al., 2012) and PreSIL (Hurl et al., 2019), each with their own sizes and aspect ratios. Empirically, we first split the Sintel dataset into training and testing parts. The mix-training process only uses the training set, and the performance on the testing set is monitored periodically. We notice that the evaluation performance no longer decreases after around 10K steps, so we assume that the network fits on all three datasets well. Finally, we use all available samples in the datasets and manually stop training at 10K steps. This process takes less than 20 GPU hours.

5.6 Loss Functions

Apart from the loss terms measured between the predicted and the ground truth depth, we need another two losses to train the VQVAE module. \(\mathcal L_{Sim}\) encourages similarities between the encoded features z and the features c in the codebook, and \(\mathcal L_{Dis}\) encourages dissimilarities among features in the codebook:

$$\begin{aligned} \begin{aligned} \mathcal L_\text {Sim}&= \sum _i \Vert z_i - c_i \Vert , \\ \mathcal L_\text {Dis}&= \sum _{i,j} \exp {\left( -\Vert c_i - c_j \Vert \right) }. \\ \end{aligned} \end{aligned}$$
(5)

Our final loss for each sample is reached as

$$\begin{aligned} \begin{aligned} \mathcal L&= (\mathcal L_\text {Log} \text { or } \mathcal L_\text {Affine} \text { or } \mathcal L_\text {InvAffine}) \\&\quad + \omega _1 \mathcal L_{Sim} + \omega _2 \mathcal L_{Dis}, \end{aligned} \end{aligned}$$
(6)

where \(\omega _1=0.5\) and \(\omega _2=0.2\). We use \(\mathcal L_\text {Log}\) or \(\mathcal L_\text {Affine} \) or \(\mathcal L_\text {InvAffine}) \) based on the characteristics of each dataset.

6 Dataset

The task of MDE is heavily data-dependent, to train a unified network model that works well across diverse real-world scenes, we use over 1 million images from different datasets. We use publicly available datasets (Fig. 7) to train our network, whose details are reported in Table 1.

According to the manner how the depth maps are captured, existing monocular depth datasets can be roughly classified into the following categories: (1) Captured using active depth sensors: NYU (Silberman et al., 2012), KITTI (Geiger et al., 2013), DIODE (Vasiljevic et al., 2019), DIML (Kim et al., 2016; 2) Computed by stereo matching: HRWSI (Xian et al., 2020), Cityscapes (Cordts et al., 2016), ReDWeb (Xian et al., 2018), DIML (Kim et al., 2016; 3) Computed by structure-from-motion: MegaDepth (Li et al., 2018; 4) Synthetically rendered PreSIL (Hurl et al., 2019), GTA (https://github.com/gta5-vision/GTA5-depth-estimation), VKITTI (Cabon et al., 2020), Eden (Le et al., 2021), Sintel (Butler et al., 2012) and (5) Predicted by using state-of-the-art monocular depth estimation methods DPT (Ranftl et al., 2021) on ImageNet 1K (Deng et al., 2009), MegaDepth (Li et al., 2018), VIPER (Richter et al., 2017). These datasets cover urban autonomous driving scenarios (Geiger et al., 2013; Cabon et al., 2020; Cordts et al., 2016; Hurl et al., 2019; Richter et al., 2017) (https://github.com/gta5-vision/GTA5-depth-estimation), indoor daily life (Silberman et al., 2012; Vasiljevic et al., 2019; Kim et al., 2016) and synthetic contents (Hurl et al., 2019; Cabon et al., 2020; Le et al., 2021; Butler et al., 2012) (https://github.com/gta5-vision/GTA5-depth-estimation).

Fig. 9
figure 9

Model predictions on datasets used for training

Fig. 10
figure 10

Qualitative results on the DIW dataset, our unified MDE model owns excellent generalization ability

Our dataset collections contain images with different resolutions, aspect ratios and field-of-views. Images in the PreSIL dataset (Hurl et al., 2019) have resolutions up to 1080 \(\times \) 1920, NYU (Silberman et al., 2012) contains images of smaller sizes at 480\(\times \)640. Images in KITTI (Geiger et al., 2013)/VKITTI (Cabon et al., 2020) are extremely wide, with 3.45:1 aspect ratios, while ImageNet has images vertically shot. Datasets whose depth maps are actively captured usually contain samples with fixed focal length as the same devices are used. The datasets whose depth maps are from stereo matching, multi-view reconstructions or from state-of-the-arts generally contain images with various field-of-views.

Based on the characteristics of each dataset, appropriate loss functions should be used by considering the properties of each dataset. We show the corresponding loss functions in Table 1, where each loss function has been introduced in Sect. 5.4.

7 Experimental Results

7.1 The RVC Leaderboard

The Robust Vision Challenge held at ECCV 2022 requires competitors to use a single model and achieve good performance on multiple benchmarks. For MDE, the benchmarks include KITTI (Geiger et al., 2013), Sintel (Butler et al., 2012) and VIPER (Richter et al., 2017).

As of 3rd, October 2022, our model ranked first on all the three benchmarks. Detailed metrics are reported in Table 2, depth predictions on the Sintel dataset are visualized in Fig. 8.

7.2 Wider Applications

Trained on diverse data shown in Fig. 7, our model learns to predict accurate depth map given diverse scenes (shown in Fig. 9). Our model not only achieves top metrics in the benchmark, but also can be applied well in daily life scenarios. We show model inference results on unseen diverse scenes from the DIW (Chen et al., 2016) dataset in Fig. 10 to demonstrate the strong generalization ability of our robust MDE model.

8 Ablations and Discussions

In this section, we conduct a series of ablation studies to analyze the contribution of each module of our unified network architecture. Furthermore, we report the inference time. We conclude this section with discussions on limitations and failure cases.

8.1 Backbone Networks

We conducted ablation studies on the KITTI dataset by switching to using other backbone networks, including Vision GNN (Han et al., 2022), ConvNext (Liu et al., 2022) and CSwin (Dong et al., 2022), the best performance of them is reported in Table 3.

We find that modern backbones with large-scale parameters achieve similar performance, but the Swin transformer stands out, considering that it can bring out the most accuracy in KITTI.

Table 3 Performance comparison of MDE models adopting different backbone networks

8.2 The VQVAE Module

As introduced in Sect. 4.3, we added a VQVAE module at the bottleneck layer of the network. We hope that such a module can provide content-related embeddings, which serves as a data-specific guidance but without explicitly identifying images from different domains.

We conduct an experiment where we adopt the same settings in Fig. 5, but we build two models with and without the VQVAE module. We show in Fig. 11 that after adding the VQVAE module, the network achieves better performance when not training on the corresponding dataset. This indicates that the VQVAE module can provide embeddings that are helpful for robust MDE.

Table 4 shows the effectiveness of the VQVAE module when the model is trained with fully-mixed NYU and KITTI. The model with the VQVAE module can achieve better performance simultaneously on two datasets than the model without.

8.3 Depth Regression

Table 5 demonstrates the effectiveness of the convex upsampling module and the activation function. A visual comparison of convex upsampling and bilinear upsampling is shown in Fig. 12. The convex upsampling module helps the network produce depth estimations with sharper boundaries, while the softplus activation function helps further improve the accuracy, while providing more flexibility in terms of metric depth range.

8.4 Depth Scale Over-Fitting

We show in Fig. 5 the absolute relative error when trained iteratively on two datasets, and the error shows extreme deviation on different training steps. Since this metric is scale-sensitive, we report the SILog metric, which is scale-invariant, in Fig. 13.

Fig. 11
figure 11

Evaluation error on the KITTI dataset with and without the VQVAE module

Table 4 The effectiveness of the VQVAE module on the joint training of KITTI and NYU
Table 5 The effectiveness of the depth regression module
Fig. 12
figure 12

A visual comparison of upsampling methods

Fig. 13
figure 13

The scale invariant error of KITTI with different iterating frequency

Compared with scale-sensitive errors, the scale invariant error shows less severe deviations. This indicates that predicting metric depth is harder than predicting the structural relationships, but the forgetting issue still exists, and the scale-invariant metric gradually deteriorates when not trained on the specific dataset.

8.5 Benefits for Zero-Shot MDEs

MiDaS (Ranftl et al., 2020) proposes to measure the error after the per-image scale and shift alignment in the inverse-depth space. Following this, we measure the performance of our model on TUM after applying the optimal affine transformation in the depth space. Likely-wise, the two parameters are obtained by least-square fitting.

We report the metrics of our model and NeWCRFs (Yuan et al., 2022) on the Dynamic Objects subset of TUM in Table 6. We also train both networks on the 3D Object Reconstruction category of TUM for 3 epochs, for the domain information of TUM, and report the evaluation results on the Dynamic Objects category, in Table 6 (see +data rows).

The results in Table 6 indicate that our model owns excellent generalization ability, which can save training time. When the data from a target domain is available, higher performance can be achieved.

Table 6 Evaluation metrics on the TUM Dynamic Object subset
Table 7 The inference speed of our method and DPT given inputs with different sizes

8.6 Inference Speed

Our model contains 372 M parameters, and we show in Table 7 the inference speed on a single NVIDIA 3090 GPU given input images with different sizes. The results are averaged over 1000 runs. Our model can provide robust MDE results with a reasonable speed, which further proves the practicability of our method. Thanks to the windowed attention mechanism in both the encoder and decoder, our model is with almost identical speed given inputs with different resolutions, which can support faster inference with input of larger sizes.

Fig. 14
figure 14

Failure cases. Our method can be less robust when facing too complex structures, too blurry images, close-up photo shooting scenarios, and may make counter-intuitive depth estimations which are indicated by arrows

8.7 Limitation and Failure Cases

We show our failure cases in Fig. 14, where samples are from the DIW (Chen et al., 2016) dataset. Our method can fail to generate depth predictions under specific conditions where images may contain too complex or too blurry scenes. Our model can make common mistakes where it can produce depth predictions with counter-intuitive ordinal relationships.

As mentioned in Sect. 5.4, our model directly produces depth estimations. This works fine for evaluation on benchmarks, but may reduce the robustness when predicting depth for far-away objects, since some datasets cannot provide valid supervision in the sky region, and they can vary on the maximum depth value.

9 Conclusion

In this paper, we proposed Megatron, our winning solution to the monocular depth estimation track in the Robust Vision Challenges 2022. We tackled the challenging task from three different perspectives, namely, network architecture, training strategy and dataset. Our network with tailored components, trained using diverse dataset show strong performance across all three benchmarks, and can produce plausible MDE predictions on various scenes. Our proposed mix training strategy Random Iterator Selection supports various image resolutions and tailored loss functions. Our solution towards robust and unified MDE is not limited to the task of MDE, but can also be transferred to other tasks.