1 Introduction

Estimating depth from a single image is a task that humans can accomplish easily, but achieving high precision and low resource requirements with computational models is notoriously difficult. Depth estimation is a fundamental problem in computer vision that is significant for various applications such as scene understanding [1], robot navigation, autonomous driving, augmented reality, scene 3D reconstruction [2], and obstacle detection [3].

Monocular depth estimation (MDE) is the task of obtaining depth information for each pixel from a single RGB image. It is a challenging task because obtaining 2D depth information from a 3D scene is an inherently ill-posed and ambiguous problem. A single 2D depth image can be generated from an infinite number of 3D scenes [4]. Furthermore, retrieving depth information without the assistance of additional data, such as stereo images, optical flow, point clouds, and other data, is extremely difficult.

While devices such as depth cameras and LIDAR can directly obtain depth information, they can be quite expensive. An alternative approach is to use binocular images and video sequences to estimate depth [5,6,7,8]. However, stereo matching based on binocular vision requires pixel-by-pixel correspondence and disparity calculation, resulting in higher computational complexity for matching. Moreover, a single pixel may match numerous identical feature points in low-texture scenes, leading to poor matching outcomes. In contrast, monocular depth estimation is relatively less expensive and more easily accessible. With the development of convolutional neural networks (CNNs), monocular depth estimation methods based on CNNs have emerged as an alternative to earlier methods that relied on manually created features [9,10,11].

Previous depth estimation methods have heavily relied on CNN-based techniques, which have significantly improved accuracy. However, CNN-based methods are not always able to make accurate estimates for complex scenes or areas with missing depth information. To address these challenges, researchers have attempted to increase the depth of the model to expand the receptive field of convolution and improve feature extraction capabilities. However, increasing the depth of the model also leads to an increase in the number of parameters, making the model larger and more resource-intensive.

The recent proposals of Transformer [12] and ViT [13] have led to a new approach in computer vision. ViT uses self-attention to learn global information in images for various vision tasks, and there are many ViT-based models in monocular depth estimation [14,15,16,17,18,19]. ViT is well-suited for extracting global features in vision tasks, but it also makes the model larger, slower to infer an image, and more difficult to train.

Fig. 1
figure 1

The overall framework of EMTNet. Our architecture consists of two major components: the encoder section for extracting depth features and the decoder section for fusing features at each scale, where the encoder consists mainly of Linear Block (LB) and Mobile Transformer Block (MTB). The diagrams of LB and MTB are shown below the overall architecture diagram, from left to right

In this paper, we propose the Efficient Mobile Transformer Network (EMTNet) for real-time scene depth estimation show in Fig. 1. Inspired by MoCoViT [20], we use the mobile transformer block (MTB) to reuse redundant parameters in self-attention calculations, reducing the number of parameters and improving the real-time performance of the model. The EMTNet encoder utilizes both CNN and ViT architectures to extract deep features from local and global scales. Furthermore, we use the DPT [21] decoder to restore resolution and fuse multi-scale depth information to produce high-resolution depth maps.

To validate the performance of our proposed model, we evaluated it on two monocular depth estimation datasets, NYU Depth V2 (indoor dataset, depth range 0-10 m) and KITTI (outdoor dataset, depth range 0-80 m), with corresponding training configurations. Our experiments demonstrate that our depth map output has higher resolution and finer detail than other techniques. Moreover, our method achieves a frame rate of 32 FPS in real-time while maintaining high accuracy depth map output.

The main contributions of this paper are as follows:

  • Proposed a new model for real-time monocular depth estimation based on MTB and named EMTNet. MTB uses the Branch Sharing scheme to simplify the computation of the attention graph, thus reducing the number of parameters in the model and achieving real-time detection. It achieves a harmonious balance between real-time capability and minimal parameters within the same architecture.

  • In order to enhance the feature extraction capability, we construct the encoder segment of the model by combining the CNN and ViT architectures. The encoder acquires deep features from two scales, local and global, respectively, which greatly improves the model’s ability to capture deep information.

  • We conducted experiments on two public datasets, NYU Depth V2[22] and KITTI[23]. The experimental results showed that our method achieved better results in all equivalent architectural models. Meanwhile, our model outperformed other methods in the output prediction results, and it also contained more obvious detail information in complex scenes.

The remainder of this work has been structured in the following manner. Section 2 consists of a literature review pertaining to this research direction, where the research conducted in the field of depth estimation over the past few years is discussed. Section 3 has provided a detailed description of the architecture of the network that was proposed in this paper, along with its implementation specifics. In order to validate the efficacy of our approach, several experiments have been carried out in Sect. 4, and their outcomes have been discussed. Section 5 addresses the limitations of our methodology and outlines potential avenues for future research. Ultimately, the results of this work have been summarized in Sect. 6.

2 Related work

In this section, we will introduce the research background in the field of monocular depth estimation and Vision Transformer, and summarize the methods used in some of the past previous work.

2.1 Monocular depth estimation

Depth estimation from a single color image has been an active area of research in robotics and computer vision for more than a decade. Early methods relied on hand-crafted features and probabilistic graphical models to estimate depth from RGB images captured by monocular cameras. For example, Saxena et al. [24] estimated the absolute scales of different image patches and inferred depth using a Markov random field model. Nonparametric methods [25,26,27,28] have also been used to estimate depth by combining the depth of the image with similar photometric content retrieved from a database. In recent years, depth estimation has shifted toward modern deep learning-based methods [29,30,31], replacing manual feature representations with learned features extracted from neural networks.

The state-of-the-art methods for depth estimation from RGB images involve training convolutional neural networks using large-scale datasets. For example, Eigen et al. were among the first to use deep learning for this task [29]. They proposed a two-stack CNN approach, where one stack predicts the global coarse scales and the other stack refines local details, using fine-scale networks for more accurate depth maps. Eigen and Fergus [9] further incorporated auxiliary prediction tasks into the architecture. Liu et al. [32] combined a deep CNN with a continuous conditional random field to obtain sharper transitions and local details. Laina et al. [30] developed a deep residual network based on ResNet [33] and achieved even higher accuracy than previous methods. To solve the ambiguity problem in prediction, Qi et al. [34] trained their network to estimate both depth and normals.

Depth estimation is commonly addressed as a dense prediction regression problem, but recent research has explored treating it as a classification problem. This involves dividing the depth range into multiple bins and predicting which bin each pixel belongs to. Fu et al. [31] pioneered this approach by utilizing ordinal regression to convert the depth estimation problem into a classification problem. Bhat et al. [14] uses adaptive bins and a lightweight neural network to estimate depth probability distributions, which are then combined to generate the final depth map. Li et al. [15] built on this approach and incorporated full interaction between the probability distribution and bins, using Transformer to generate bins. While predicting depth in discrete bins can simplify training with limited data, it may reduce accuracy compared to predicting continuous values, and the number of bins used can also impact accuracy.

Another promising approach to depth estimation is to use a ViT-based architecture. The ViT [13] is a deep learning model that allows the utilization of global features for a wide range of computer vision tasks. In recent years, many researchers have proposed ViT-based methods for monocular depth estimation. For example, Bhat et al. [14] and Li et al. [15] both incorporated ViT into their method to improve the accuracy of depth estimation. Other studies, such as Zhao et al. [16], Bae et al. [17], Li et al. [18], and Shu et al. [19], have also proposed ViT-based methods that achieve state-of-the-art performance on monocular depth estimation benchmarks. These methods generally leverage the attention mechanism of the ViT to capture global context information and combine it with local features to improve the accuracy of depth estimation.

2.2 Vision transformer

The Vision Transformer is a neural network architecture that has shown promising results in computer vision, leading to the emergence of many works based on ViT. For example, DeiT [35] uses knowledge distillation based on ViT [13] to train a small model that achieves accuracy comparable to that of a larger model. PVT [36] uses a pyramid attention mechanism to handle features at different scales, and a cross-layer feature pyramid to improve feature representation. TNT [37] uses a spatial Transformer network and dynamic convolution to improve the deformability and receptive field of the model. CoaT [38] is a multi-layered network structure that uses multi-scale features and multi-layered attention mechanisms to handle features at different levels. Finally, Swin Transformer [39] uses an interleaved local attention mechanism and a global attention mechanism to handle images with relatively large aspects.

Moreover, a number of lightweight ViT models have been proposed to address real-time applications. For instance, ResViT [40] suggests an improved residual connection method to further reduce the computational burden of the lightweight ViT model. MobileViT [41] introduces a lightweight ViT model for mobile devices, which delivers faster inference speed and smaller model size on mobile devices. LViT [42] achieves good performance by reducing the number of model parameters and computational complexity through the removal of unnecessary modules and downsampling of resolution. Lastly, TinyViT [43] employs grouped convolution and depth-separable convolution to reduce the number of parameters and computational complexity of the model, thus enabling efficient image classification by introducing the transformer module into the conventional neural network.

Recent research [44] has demonstrated that combining convolution and Transformer can enhance prediction accuracy and improve training stability. BoTNet [45] achieved significant advancements in instance segmentation and object detection by replacing the last three bottleneck blocks of ResNet [33] with self-attention. ConViT [46] improved ViT with soft convolutional induction bias by introducing gated position self-attention (GPSA). The CVT [47] combines CNNs with ViT to improve computer vision tasks by introducing localized convolutional operations. LeViT [48] proposes a lightweight ViT model based on the LeNet [49] architecture. In this paper, we adopt the approach of combining CNN networks and Transformer by incorporating the MTB self-attention module into a CNN network, which enhances the model’s feature extraction capabilities and real-time depth estimation performance.

3 Architecture

In this section, we will introduce the overall structure of EMTNet and explain its principles accordingly, which includes encoder and decoder parts. After that, we will introduce the implementation details of the Mobile Transformer Block (MTB), which includes Mobile Self-Attention (MoSA) and Mobile Feed Forward Network (MoFFN) specifically designed for lightweight networks. Finally, we introduce the loss function we used.

3.1 Overview of the network

Although previous CNN-based methods excel at extracting deep feature information from images, they are limited in capturing only localized features due to their restricted receptive fields. Achieving an expanded receptive field typically involves stacking multi-layer CNNs or using dilated convolutions, which inevitably leads to an increase in model parameters or loss of feature information. Our proposed model, on the other hand, capitalizes on the strengths of both CNNs and ViTs. By combining these two architecture, our model effectively extracts depth feature information at both local and global scales, substantially enhancing the overall feature extraction capability. To restore depth information for monocular depth estimation tasks, we employ a fusion module with a skip connection. This module fuses depth information from multiple scales, helping to preserve feature information at each scale during picture restoration. The overall architecture of our proposed network is illustrated in Fig. 1.

Our network follows a standard Encoder-Decoder architecture, comprising an encoder for extracting depth features and a decoder for fusing multi-scale features. The encoder comprises four stages, the first two of which are Linear Block (LB) that extract local features in the scene using a Conv-net style. The latter two stages use Mobile Transformer Block (MTB), which includes two sub-modules, MoSA and MoFFN, designed to extract global information from the scene while reducing computational effort.

EMTNet Encoder. This part is used to extract depth features. It contains a total of four stages, each consisting of \(N_{i}\) identical modules, and we set \(N_1\), \(N_2\), \(N_3\) and \(N_4\) to 4, 4, 12 and 6 respectively. First, input images are processed by a CONV stem with two 3 \(\times \) 3 convolutions with stride 2 as patch embedding, which is used to speed up the subsequent model processing,

$$\begin{aligned} X_1^{B, C_{j \mid j=1}, \frac{H}{4}, \frac{W}{4}} =\textrm{PatchEmbed}\left( X_0^{B, 3, H, W}\right) , \end{aligned}$$
(1)

where \(C_j\) represents the number of channels in the jth stage, \(X_0\) and \(X_1\) denote the input of the image and the output of the CONV stem, respectively. The \(X_1\) then fed into the first stage, where we use four LB to initially extract the feature information. the structure of the LB is shown in Fig. 1, it starting with a pooling layer to extract the low-level features,

$$\begin{aligned}&I_i=\textrm{Pool}\left( X_i^{B, C_j, \frac{H}{2^{j+1}}, \frac{W}{2^{j+1}}}\right) +X_i^{B, C_j, \frac{H}{2^{j+1}}, \frac{W}{2^{j+1}}}, \nonumber \\&X_{i+1}^{B, C_j, \frac{H}{2^{j+1}}, \frac{W}{2^{j+1}}} =\textrm{Conv}_B\left( \textrm{Conv}_{B, G}\left( I_i\right) \right) +I_i, \end{aligned}$$
(2)

where \(\textrm{Pool}\) indicates a pooling operation, \(\textrm{Conv}_{\textrm{B},\textrm{G}}\) denotes a subsequent convolution containing both BN and GELU operations, and \(\textrm{Conv}_{\textrm{B}}\) denotes a subsequent convolution containing only BN operations.

The second stage of the operation is similar to the first stage, with the use of 4 LB blocks. However, after passing through the Embedding layer, the feature map becomes half the size but with twice the number of channels.

In the third and fourth stages, we employ 12 and 6 MTBs respectively to extract global information from the features. The MTBs are constructed by modifying traditional self-attention, and consist of two sub-modules: MoSA and MoFFN.

$$\begin{aligned}&I_i=\textrm{MoSA}\left( X_i^{B, \frac{H W}{4^{j+1}}, C_j}\right) +X_i^{B, \frac{H W}{4^{j+1}}, C_j}, \nonumber \\&X_{i+1}^{B, \frac{H W}{4^{j+1}}, C_j}=\textrm{MoFFN}\left( I_i\right) +I_i, \end{aligned}$$
(3)

where i denotes the tokens entered by the ith MTB and j denotes the module operation at stage j.

Fig. 2
figure 2

The overview of Fusion module. The Fusion module consists mainly of a CNN architecture for fusing features at different scales and upsampling the output to the next module

We decided to place LB before MTB in our network architecture based on the intuition that LB is better suited for extracting local feature information for constructing edge contours, while MTB is better suited for extracting global features to estimate continuous large areas. In monocular depth estimation, the contour information of objects is particularly important as the most distinct depth variation is often found at the edges of the object, while the depth variation is smoother or more consistent inside the object contour. Although CNN-based models do expand the receptive field of convolution when dealing with higher dimensional features and to some extent use global information, they do not perform as well as MTB in extracting global information. Therefore, placing LB before MTB was based on our consideration of feature extraction. In the following sections, we will describe the overall structure of the EMTNet in more detail.

EMTNet Decoder. We designed the Fusion module for fusing intermediate features from the previous Fusion module and the corresponding encoder stage. The use of skip connections in the Fusion modules allows for the preservation of feature information at multiple scales, preventing information loss during image restoration. The final output feature map is 1/4 the size of the original map after being downsampled by the four Fusion modules. To produce the final depth prediction, we added an output header dedicated to depth estimation. The detailed structure of the decoder is shown in Fig. 2.

The Fusion module starts with a convolutional layer to adjust the dimensionality of the feature map, which we set to remain the same before and after the convolution in our implementation. We also plan to incorporate depthwise separable convolution (DSC) in subsequent ablation experiments to test our hypothesis. The output of the convolutional layer then undergoes a Residual Conv Unit, which is added to the output of the previous module. The result is passed through another Residual Conv Unit, followed by upsampling and linear projection for the final output to the next module. To ensure better performance, we use the GELU activation function instead of the ReLU activation function in the Fusion module, as verified in the ablation experiment Sect. 4.4. In the following section, we will delve into the specifics of the MTB.

3.2 Mobile transformer block

Although CNN-based methods can extract depth features, they are limited in their ability to extract global feature information due to their inherent characteristics. The Transformer [12] and ViT [13] were proposed to address this limitation by enabling models to extract features by combining global information from the image, thus improving generalization ability and accuracy. However, these approaches using global attention have a significantly larger number of parameters compared to CNN-based models, making them less suitable for real-time applications. In our proposed method, we enhance traditional self-attention by introducing MTB to reduce the number of parameters and FLOPs of the model, enabling real-time monocular depth estimation with improved accuracy. Self-attention used in the traditional ViT is,

$$\begin{aligned} \textrm{Self-Attention}(Q,K,V) = \textrm{Softmax} \left( \frac{QK^T}{\sqrt{d_k}}\right) V, \end{aligned}$$
(4)
Fig. 3
figure 3

The Mobile Transformer Block. Mobile Transformer Block consists of Mobile Self-Attention (MoSA) and Mobile Feed Forward Network (MoFFN). The branch sharing mechanism in MoSA avoids computing Q and K, and computes the attention map by reusing V. Ghost module is used to replace Linear layer, and LayerNorm is removed for efficiency

where Q, K and V are the three matrices that can be learned by model. The Q and K matrices operations take up the majority of the model’s processes for the calculation of self-attention, so this is where the MTB module needs to be improved the most.

While self-attention is a powerful mechanism for capturing global dependencies in an image, it becomes less advantageous than convolutional layers in lightweight models with constrained capacity due to its quadratic computational complexity with relation to spatial resolution. To compute a linear combination of results for V, traditional self-attention requires three linear layers of the same level. When dealing with multi-head self-attention, the superposition of multiple self-attentions significantly increases the number of parameters. To address this problem, we introduce MoSA, which replaces traditional self-attention with an attention mechanism specifically designed for lightweight Transformer structures.

Mobile Self-Attention (MoSA). MoSA uses a branch sharing scheme to reuse weights in the Q, K and V calculations, making it an attention mechanism designed for lightweight Transformer structures. As shown in Fig. 3, \(F^q\), \(F^k\) and \(F^v\) are projections of Q, K and V with the same input features, respectively. The approach reuses the features V directly into Q and K based on the intuition that Q and K are only involved in the computation of the attention graph, while the final result of the self-attention mechanism is a linear combination of each token in V. Thus, V must retain more semantic information than Q and K to guarantee the final weight and representational power of the results. As a result, the correlation between the results of self-attention and V is stronger than their correlation with Q and K. This simplifies the computation of Q and K for real-time tasks, achieving a better balance of performance overhead. Compared to traditional self-attention, MoSA replaces the Q and K matrices with the V matrix, resulting in fewer parameters and faster computation.

$$\begin{aligned} F^v&= F^k = F^q \nonumber \\ V&= F^v(X) \nonumber \\ K&= F^k(X) = V^T \nonumber \\ Q&= F^q(X) = V \end{aligned}$$
(5)

where \(F^v\), \(F^q\) and \(F^k\) are projections used to compute V, Q and K, respectively. To avoid feature loss, a depth-separable convolution branch is added to the output of V. The improved self-attention is,

$$\ {\text{MoSA}}(V) = {\text{Softmax}}\left( {\frac{{VV^{T} }}{{\sqrt {d_{k} } }}} \right)V + {\text{Depthwise}}(V) $$
(6)

Mobile Feed Forward Network (MoFFN). MoFFN is a fine-grained feature operation that replaces the linear layer in traditional self-attention with a more efficient Ghost [50] module. To extract features on the channels, MoFFN contains two Ghost modules with a Squeeze-and-Excitation Networks (SENet) [51] inserted between them. In image processing tasks, channel domain attention explicitly models the interdependencies between feature channels. Following the suggestion of Hu et al. [51], we placed the SENet inside the model of the residual structure, and the module ends with a residual connection.

Figure 3 illustrates the MoFFN module and the Ghost structure, a widely used technique in lightweight networks for constructing features in a cost-effective manner. The Ghost module uses standard convolution to generate a few intrinsic feature maps, which are then expanded to a larger number of channels using cheap linear operations. To achieve a better balance between performance and speed, these linear operations are typically implemented as depthwise convolutions. MoFFN can be expressed as follows,

$$\begin{aligned}&y=\textrm{Ghost}(\textrm{SE}(\textrm{Ghost}(x)))+x \nonumber \\&\textrm{Ghost}(x)=\textrm{Concat}\left[ \textrm{DWConv}_{B, G} \left( \textrm{Conv}_{B, G}(x)\right) , \textrm{Conv}_{B, G}(x)\right] \end{aligned}$$
(7)

where SE denotes the channel attention module, DWConv\(_{B,G}\) denotes the subsequent depthwise separable convolution containing BN and GELU operations, and Conv\(_{B,G}\) denotes the subsequent ordinary convolution containing BN and GELU operations.

The Ghost module is a widely acknowledged structure in lightweight networks, and its effectiveness has been extensively demonstrated. The MoFFN, consisting of the Ghost module and SENet [51], serves as an efficient replacement for the traditional Feed-Forward Network (FFN) in ViT. The MoFFN module proves highly effective in addressing real-time tasks.

MoSA and MoFFN together constitute the MTB. In this work, we leverage MTB to reduce the computational load of the model, enhancing its efficiency while preserving model accuracy. The primary objective of adopting MTB is to streamline attention computation and improve the real-time speed of the model. Significantly, the computational speed of MTB outperforms that of the traditional self-attention module. In the forthcoming experimental section, we will comprehensively compare the information processing speed of models featuring different architectures, providing a comprehensive analysis of their respective performances.

Loss function: Inspired by [52], we use the Scale-Invariant (SI) loss proposed by Eigen et al. [29] as our loss function,

$$\begin{aligned} \ell _{pixel} = \alpha \sqrt{\frac{1}{T} \sum _{i}^{}g_i^2 - \frac{\lambda }{T^2} \left( \sum _{i}^{} g_i \right) ^2} \end{aligned}$$
(8)

where \(g_i=\log \tilde{d}_i-\log d_i\), \(d_i\) is the ground truth depth, \(\tilde{d}_i\) is the estimated depth and T denotes the number of pixels having valid ground truth values. We use \(\lambda \) = 0.85 and \(\alpha =10\) for all our experiments.

4 Experiments

We have conducted extensive experiments on standard depth estimation for single image datasets for both indoor and outdoor scenes. In the following, the first section begins with a brief description of the individual datasets and the evaluation metrics. The second section describes the implementation details of the experiments. In the third part, we compare the model quantitatively with previous monocular depth estimation methods and perform generalizability tests. In the fourth section, we conduct ablation experiments to validate the effectiveness of our network. In the last section, we summarize and analyze all the experimental results and discuss the final results of the experiments.

4.1 Datasets and evaluation metrics

We tested the model on three datasets and in that section the datasets and the treatment of the data are presented. The evaluation metrics used is presented at the end.

NYU Depth V2 is a dataset that provides images and depth maps of various indoor scenes captured at a pixel resolution of 640 \(\times \) 480 [22]. The dataset comprises 1,449 densely labeled images and 407,024 pseudo-labeled and unlabeled images. We used a subset of 24,231/654/654 images for training, validation, and testing. During the training period, we preprocessed the original data through random cropping and rotation, using a crop size of 416\(\times \)544. We used the original image size of 480\(\times \)640 during the testing period.

KITTI is a dataset that provides stereo images and corresponding 3D laser scans of outdoor scenes captured using equipment mounted on a moving vehicle [23]. The KITTI dataset contains real image data collected from urban, rural, and highway scenes, with a sampling resolution of 375\(\times \)1,242, sampled and synchronized at 10Hz. We selected the depth prediction dataset as our model test data, using a subset of 23,158/697/697 images for training, validation, and testing, respectively. During the training period, we performed image enhancement on the original data through random cropping and rotation in the image preprocessing step. We used a crop size of 320\(\times \)1,056 for the random cropping operation. We did not use the same preprocessing operation for validation, and we used the original size of 375\(\times \)1,242 for both validation and testing.

SUN RGB-D is a publicly available dataset on scene understanding from the Vision & Robotics Group at Princeton University. SUN RGB-D is captured by four different sensors and contains 10,000 RGB-D images, at a similar scale as PASCAL VOC [53]. The whole dataset is densely annotated and includes 146,617 2D polygons and 58,657 3D bounding boxes with accurate object orientations, as well as a 3D room layout and category for scenes. This dataset enables us to train data-hungry algorithms for scene-understanding tasks, evaluate them using direct and meaningful 3D metrics, avoid overfitting to a small testing set, and study cross-sensor bias. We use this dataset as a benchmark of model generalization ability for determining the generalization results of different models trained on NYU Depth V2.

Evaluate metrics. To evaluate the accuracy of the depth estimation results, we used the generalized standard metric for depth estimation proposed by Eigen et al. [29]. These error metrics are defined as:

  • threshold accuracy (\(\delta _i\)): % of \(y_p\) s.t. \(\max (\frac{y_p}{\hat{y}_p},\frac{\hat{y}_p}{y_p}) = \delta < thr\) for \(thr= 1.25, 1.25^{2}, 1.25^{3}\);

  • absolute relative error (AbsRel): \(\frac{1}{n} {\sum _{p}^{n}} \frac{|y_{p}-\hat{y}_{p} |}{y}\);

  • squared relative error (SqRel): \(\frac{1}{n} {\sum _{p}^{n}} \frac{{\Vert y_p-\hat{y}_p \Vert }^2}{y}\);

  • root mean squared error (RMSE): \(\sqrt{\frac{1}{n} {\sum _{p}^{n}} (y_p-\hat{y}_p)^2}\);

  • root mean squared log error (RMSE\(_{log})\): \(\sqrt{\frac{1}{n} {\sum _{p}^{n}} {\left\| log(y_p)-log(\hat{y}_p)\right\| }^2}\);

where \(y_p\) is a pixel in depth image y, \(\hat{y}_p\) is a pixel in the predicted depth image \(\hat{y}\), and n is the total number of pixels for each depth image.

4.2 Implementation details

We implemented the proposed method using PyTorch version 1.12 on Ubuntu 20.04, and trained it on a single NVIDIA GeForce RTX 2080 Ti graphics card. Prior to inputting the original image into the model, we applied standard data augmentation and image enhancement techniques to the image. The specific methods are as follows:

  • Cropping: both the input image and the target image were randomly cropped. For the NYU Depth V2 dataset, the image was cropped to a size of 416 \( \times \) 544, and for the KITTI dataset, the image was cropped to a size of 320 \( \times \) 1056.

  • Rotation: we randomly rotated the input image and target image between the angles \( r \in [-2.5,2.5] \).

  • Gamma enhancement: we applied gamma enhancement to the original image with \(\gamma \) powers. The value of \(\gamma \) was randomly selected from the range \(\gamma \in [0.9, 1.1]\).

  • Brightness enhancement: we multiplied the original image with d to create a random variation in brightness. For the NYU Depth V2 dataset, the value of d was randomly selected from the range \(d \in [0.75,1.25] \), and for the KITTI dataset, the value of d was randomly selected from the range \(d \in [0.9,1.1] \).

  • Color enhancement: we multiplied the original image by a random RGB value \( c \in [0.9,1.1] \).

  • Horizontal flip: we randomly flipped the image and target image horizontally with a probability of 0.5.

In addition to the above operations, several training techniques were used to accelerate the convergence of the model. The training method was set as follows.

We employed the AdamW optimization algorithm with weight decay 0.1 to update the parameters of our models during the training period. The maximum learning rate for the NYU Depth V2 and KITTI datasets was set to \(5 \times 10^{-4}\) and \(3.75 \times 10^{-4}\), respectively. To accelerate the convergence of the model, we also employed learning rate warm-up [33] and OneCycleLR policy [54]. Specifically, we set the learning rate with max_lr = \(3.5 \times 10^{-4}\) and warm-up the learning rate from max_lr/25 to max_lr for the first 30% of iterations, followed by cosine annealing to max_lr/100. The number of training epochs was set to 200, with a batch size of 10, until the model finally converged and stopped. Training our model took approximately 25 min per epoch on a single node with one NVIDIA GeForce RTX 2080 Ti graphics card. Finally, we validated each dataset after training and tested the test set with the best model on the validation set.

4.3 Comparison with the most advanced available

We assess the efficacy of our proposed network through rigorous evaluations on two datasets: the NYU Depth V2 and KITTI. To ensure a comprehensive appraisal, we judiciously curate pertinent methodologies hitherto applied in real-time monocular depth estimation. These established approaches serve as benchmarks for comparative analysis, affording invaluable insights into the performance of our architectural model. Furthermore, we conduct expansive generalizability assessments on the SUN RGB-D dataset, alongside meticulous ablation experiments aimed at validating the model’s robustness.

Table 1 Comparison of performances on the NYU Depth V2. The reported numbers are from the corresponding original papers. Measurements are made for the depth range from 0 m to 10 m. The best results are in bold, second best are underlined

Results on NYU Depth V2. Table 1 presents a comprehensive performance comparison on the official NYU Depth V2 test set. Our model outperforms previous methods on most metrics, showcasing its superiority. However, we noticed that it does not exhibit significant improvement on certain metrics, such as RMSE and RMSElog, and only marginal enhancements on others. We speculate that one of the reasons why our method did not show a strong advantage on this dataset may be that the MTB module is too much on the speed side and somewhat less on the accuracy side, which is inevitable. Nevertheless, Fig. 4 visually illustrates the depth prediction results of our model alongside its comparison model, emphasizing subtle differences in the same scene with white dashed boxes. Despite not necessarily dominating in quantity, our method excels in the quality of generated depth maps, particularly in capturing finer detail information within complex scenes. Moreover, our model exhibits remarkable depth completion ability in regions with missing depth compared to other models. In contrast, outputs from comparison models like An et al.[61] and Wofk et al.[56] exhibit erroneous estimations in depth-missing regions, rendering them nearly indistinguishable from the surrounding scene information. These findings highlight the strength of our model in producing accurate and detailed depth maps, especially in challenging scenarios.

Results on KITTI. Table 2 provides an overview of the performance metrics for all related models on the KITTI test set. Our model emerges as the clear leader, showcasing significantly superior results across all evaluated metrics. Particularly noteworthy is its outstanding performance when compared to methods like MonoFormer, Lite-Mono, and Varma et al., which also utilize the Transformer architecture. Quantitatively, our approach obviously outperforms these Transformer-based methods. Figure 5 illustrates the depth prediction results for our model alongside the comparison methods. Focusing on areas highlighted by white dashed boxes in the figure, our model excels in capturing intricate details and exhibits exceptional depth prediction accuracy, especially in regions with missing depth information.

Table 2 Comparison of performances on the KITTI. The reported numbers are from the corresponding original papers. Measurements are made for the depth range from 0 m to 80 m. The best results are in bold, second best are underlined

Results on SUN RGB-D. Table 3 presents the results of the generalization tests on SUN RGB-D. For the assessment of generalization ability, we carefully selected a diverse set of methods with different architectures for comparison. All models were pre-trained on NYU Depth V2 without fine-tuning their parameters. Among the models tested, Adabins [68] represents the depth estimation model using a pure Transformer architecture. Upon analyzing the results, we observed that our model exhibits a slightly lower overall generalization ability compared to the Transformer architecture approach. However, when compared to hybrid architectures like MonoFormer [65] and Lite-Mono[66], our method demonstrates better generalization ability. Moreover, in the comparison with methods utilizing CNN architecture, our model emerges as the more advantageous choice in terms of generalization performance.

Table 3 Comparison of performances on SUN RGB-D test set without fine-tuning the models trained on NYU Depth V2. The best results are in bold and second best are underlined. The range of ground truth depth for evaluation from 0m to 8m

We conducted a comprehensive comparison of different models by evaluating both Params and real-time performance (FPS) on the same device, ensuring a fair assessment under identical conditions. Table 4 presents the test results, with AbsRel measured on the KITTI dataset. Theoretical analysis suggests that the Transformer architecture typically exhibits higher computational complexity than CNN-based models due to the inclusion of the attention module. The test results show that our model showcases a significant advantage, boasting fewer parameters compared to a model employing the Transformer architecture. Furthermore, when pitted against hybrid architectures such as Lite-Mono[66] and MonoFormer (Hybrid)[65], our model outperforms in terms of accuracy. In regard to real-time performance, our model is slower than pure CNN-based models but faster and more accurate than pure Transformer-based models. This trade-off allows our approach to strike an optimal balance concerning the number of parameters and computational efficiency within the hybrid architecture models. As a result, our model achieves remarkable results in terms of both accuracy and computational performance.

Table 4 Quantitative comparison of different architecture models.
Fig. 4
figure 4

Qualitative comparison with An et al. [61], Nekrasov et al. [55], Wofk et al. [56], Ma et al. [59]. All the models are pre-trained on NYU Depth V2 [22] training set

Fig. 5
figure 5

Qualitative comparison on KITTI Eigen split [23]. For each column, from top to bottom we present the input image, the prediction from An et al. [61], Nekrasov et al. [55], Wofk et al. [56], Ma et al. [59] and EMTNet (ours)

4.4 Ablation study

In the ablation study, we evaluated the impact that the following different design choices had on our model.

Depthwise separable convolution (DSC). DSC originates from MobileNet [69], a lightweight model for computer vision tasks that is often used in environments with limited hardware resources. Intuitively, its use allows our model to be more adaptable to real-time tasks. Therefore, during the experimental stage, we used DSC in the skip-connection part between the encoder and the decoder. However, the experiments showed that including DSC reduced the number of parameters in the model, but the model’s accuracy decreased, which was not desirable.

Activation function. In our experiments, we used both ReLU and GeLU activation functions to verify the real-time performance of the model. ReLU is a linear function, while GeLU is nonlinear in the real number domain. In theory, using linear activation functions in environments with limited hardware resources allows for faster information processing. However, our experiments showed that using GeLU was superior, which was surprising.

MTB Block. MTB is the module we introduced to reduce the number of model parameters and FLOPs. In the ablation experiments, we compared the outcomes of using MTB and traditional self-attention and found that using MTB led to better performance. Although it is not as good in real-time as using DSC, it showed better accuracy performance (Table 5).

Table 5 Ablation results of the DSC, ReLU, GeLU and MTB.

4.5 Experimental discussion

We propose a real-time monocular depth estimation model named EMTNet, which is built upon the Mobile Transformer Block (MTB). EMTNet effectively integrates CNN and ViT, enabling the extraction of both local and global features in complex scenes. This synergy accounts for the network’s capacity to enhance depth map details and exhibit robust generalization capabilities. Furthermore, the Branch Sharing scheme employed by MTB efficiently reduces the model’s parameter, thereby endowing it with the capability for real-time depth estimation. EMTNet is going to outperform the other models in terms of the detail performance of the output depth map. However, using MTB comes with a trade-off between accuracy and real-time performance. While the model’s accuracy is not significantly improved compared to previous works on the NYU Depth V2 dataset, its real-time efficiency is compromised. In the generalization test on the SUN RGB-D dataset, our network performs well with the same hybrid architecture, but there is still a lot of room for improvement compared to the network using the pure Transformer architecture.

Our aim is to enhance the model’s accuracy and real-time performance, but using a global attention approach could cause a decline in real-time performance. Therefore, striking a balance between real-time and accuracy is challenging for depth estimation models. Furthermore, high-resolution depth maps are not always necessary for most depth estimation tasks, as depth information is often correlated with continuity over most regions. High-resolution outputs could only make sense in complex depth scenes, but they also reduce the model’s real-time performance. Also the high-resolution output reduces the real-time nature of the model. Thus far, we have achieved our desired results in terms of model accuracy and real-time performance.

5 Limitations and future directions

Our method has demonstrated promising results in experiments conducted on two datasets, surpassing CNN-based models and even some Transformer-based methods. However, we acknowledge that our model still has certain limitations due to architectural design deficiencies and model training issues. One major concern is the substantial number of parameters and computational complexity of our model compared to the other hybrid model (CNN+Transformer). Redundant computations also pose a challenge. Moreover, we observed variations in prediction accuracy across different datasets, with the model performing less effectively on NYU Depth V2 compared to KITTI. Some metrics showed only marginal improvement, and in some cases, even a reduction was observed. We suspect that the lack of coordination between the CNN and Transformer components during the depth feature extraction in the encoder stage results in the loss of important features during transmission. Additionally, our model faced convergence issues during training, necessitating the setting of multiple epochs for slow convergence.

To address these limitations and enhance our method, we are planning to explore alternative advanced modeling approaches. This exploration could involve delving into a pure Transformer architecture paradigm with pre-trained parameter initialization, along with an investigation into the integration of a depth-interval categorization (Bins-based) methodology to expedite the model’s convergence speed. Additionally, within the model’s training regimen, we have deliberately incorporated a broader array of data augmentation techniques. This strategic augmentation of the training dataset contributes significantly to amplifying the model’s generalization prowess. Moreover, we aspire to examine the model’s adaptability and the potential application of its enhanced methodologies in a wide range of visual tasks. These tasks encompass, but are not limited to, semantic segmentation, target detection, and multi-image 3D reconstruction.

6 Conclusion

We present EMTNet, an innovative real-time monocular depth estimation model constructed upon the Mobile Transformer Block (MTB). This model synergistically harnesses the capabilities of both CNN and ViT architectures to elevate feature extraction across local and global domains. Leveraging the Branch Sharing scheme within MTB, EMTNet successfully achieves parameter reduction, thereby optimizing its aptitude for real-time depth estimation tasks. To produce finer depth maps, we synthesize high-resolution depth maps by fusing multi-scale features in the decoder section. Our model achieves good results on two benchmark datasets. When comparing the output prediction maps, our model demonstrates superior ability in generating high-quality depth maps, especially in complex scenes. Moreover, in depth missing regions, our model excels in depth completion compared to other models. In terms of real-time performance, our approach achieves 32 frames per second, striking a harmonious equilibrium between accuracy and speed.