Introduction

Depth estimation, as a fundamental task in the field of computer vision, has been widely used in areas such as autonomous driving [1], robot navigation [2, 3] and virtual reality [4]. Depth estimation methods can primarily be divided into two types: active ranging and passive ranging. Active ranging relies on distance-measuring sensors to acquire depth information. These sensors mainly include costly LiDAR (Light Detection and Ranging) and Time-of-Flight (ToF) cameras. Passive ranging techniques estimate distances by calculating disparity. Against this backdrop, the robust capability of Convolutional Neural Networks (CNNs) for image feature extraction has greatly facilitated the advancement of monocular depth estimation technologies based on deep learning [5]. In supervised monocular depth estimation, the training process requires the use of accurate ground truth depth data. However, in complex environments, the unpredictability of these data makes their collection particularly challenging. Self-supervised monocular depth estimation methods employ synchronized stereo image pairs or monocular videos for training. Although training with monocular videos necessitates an additional pose estimation network to calculate the camera’s motion, it requires only a single camera for data collection. Therefore, the use of monocular videos in self-supervised monocular depth estimation remains widely adopted.

Fig. 1
figure 1

Comparison of Monovit [7], Lite-Mono [9] and our proposed Repmono. A MonoVIT [7] combines the VIT model with self-supervised depth estimation. B Lite-Mono [9] introduces continuous dilated convolution modules and local–global feature interaction modules. C Repmono proposes a large convolutional kernel feature extraction module and a reparameterized token mixer module

Visual Transformer, with its ability to model the global sensory field, continues to make breakthroughs in the visual domain [6], and the use of Transformer in self-supervised depth estimation is also being attempted. For example, MonoVIT [7] utilizes advanced Transformer block in its encoder to achieve accurate prediction of fine-grained depth features, overcoming the limited receptive field of CNNs. However, multi-head self-attention modules inside the Transformer make it difficult to achieve fast inference due to its complex parallel operations. Comparing to using CNN in the architecture, MT-SFMLearner [8] demonstrates that using Transformer in the architecture has higher robustness, but also brings higher parameters and lower inference speed. Lite-Mono [9] introduces Continuous Dilation Convolution (CDC) module and Local–Global Feature Interaction (LGFI) module to lighten the hybrid architecture of CNN and Transformer, however it still retains the core module of multi-head self-attention to capture global features. While several studies have existed designing Transformer with CNN in a self-supervised monocular depth estimation architecture, the fast inference of the model is neglected in order to extract rich details. We believe that using Transformer in the architecture of self-supervised monocular depth estimation will affect the computational cost and inference speed. Pursuing higher performance while bringing more computational complexity and slower inference speed is an outcome we do not wish to see, as it would limit the network’s capability in practical applications [10, 11]. Exploring how to simultaneously achieve excellent performance and fast reasoning of the network becomes the main focus of our research (Fig. 1).

In order to solve the above problems, this study designs a lightweight and efficient self-supervised monocular depth estimation using a pure CNN architecture. Drawing on the idea of Mateformer [12], we design Large Convolution Kernel Transformer (LCKT) module for multi-scale feature extraction, starting with token mixing and channel mixing, and enhance local information extraction through the use of the simple yet efficient Senet [13] attention mechanism. RepTM module proposed in this study reparameterizes deep convolutions and achieves feature extraction of global information and local details when used in conjunction with the LCKT module, which effectively improves the inference speed without sacrificing model performance. Our contributions in this paper can be summarized in the following three aspects:

  1. (1)

    We propose a self-supervised monocular deep estimation network called Repmono, which is capable of high speed inference while maintaining high performance.

  2. (2)

    We carefully design LCKT module with a large convolutional kernel and RepTM module based on the structural reparameterization technique. Experiments show that the LCKT module is able to achieve effective local and global feature extraction, and RepTM module is able to further optimize the efficiency of the network while extracting detailed features.

  3. (3)

    Evaluations on the publicly available KITTI dataset [14] show that our lightweight model has fewer parameters and higher accuracy. We also explore the generalization capability of our network architecture on the Make3D dataset [15] and the DrivingStereo dataset [16], and our model also exhibits better generalization performance compared to other lightweight models.

Related work

Monocular depth estimation

Because different scales of 3D scenes can be inferred from 2D images, depth estimation from a single image has always been a challenging issue. Supervised depth estimation utilizes real depth maps as supervision signals, enabling accurate estimation of depth from a single RGB image. Eigen et al. [17] applied deep networks to depth estimation for the first time, using a multi-scale network structure to extract global and local depth features of the input image. Subsequent studies continuously improves deep networks [18,19,20] to achieve better performance, but these studies require ground truth data in the real world, which has been a challenge. Garg et al. [21] implicitly learn depth by using a reprojection of a stereo image, where the loss function is a pixel based reconstruction loss, representing a novel approach to view synthesis. Godard et al. [22] propose monodepth based on this, incorporating left-right consistency loss to ensure consistency in depth prediction between left and right images. However the stereo images required for training of such studies are also difficult to obtain. The above studies try to get rid of ground truth data using new image reconstruction loss techniques, but inevitably use stereo images for self-supervised training, and thus self-supervised depth estimation using exclusively unlabeled ground truth depths during training is gradually becoming a viable alternative. SFMLearner proposed by Zhou et al. [23], can learn both depth and self-motion from consecutive frames of monocular video, but it cannot remove the loss introduced to dynamic objects in consecutive frames. For the treatment of dynamic objects, Vijayanarasimhan et al. [24] learn multiple object masks in a deep network. Guizilini et al. [25] combine semantic information with depth estimation information to reduce the luminosity loss due to dynamic objects, and the GeoNet model proposed by Zhichao Yin et al. [26] introduce optical flow estimation to predict the sequence of images dynamic objects in an image sequence. The above study chose to include additional tasks in the deep network to minimize the loss caused by dynamic objects, but this also increases the parameters in the network model. Monodepth2 proposed by Godard et al. [27] use an automatic masking loss to remove dynamic objects that are at the same speed as the camera, and a minimum reprojection loss is designed to deal with occlusion that occurs in the front and back frames without the need to use additional learning tasks. Therefore, the model proposed in this study follows the self-supervised training strategy based on Monodepth2 [27].

Network architecture for depth estimation

The network architecture in monocular depth estimation has a significant impact on the final depth prediction results. Prior to the application of Transformer to vision, most deep learning efforts on monocular depth estimation focused on the design of CNN architectures such as Resnet [28], VGGnet [23], HRnet [29] and Packnet [30]. These classical networks have achieved remarkable results in the application of self-supervised monocular depth estimation tasks. However, CNN models are limited by their finite receptive fields during convolution operations. By introducing attention modules, networks can more effectively fuse features and extract depth features. For example, the literature [31] improved feature fusion capability by incorporating an attention module, while the literature [32] used self-attention to enhance semantic features in a VGG [33] encoder. R-MSFM [34], a small architecture that employs a feature modulation module to learn multi-scale features, uses only the first three stages of ResNet18 [28] as a backbone network in order to reduce the number of model parameters. PydNet [35] designed an unsupervised network capable of performing depth estimation on the CPU, although with lower computational parameters, it cannot achieve more accurate depth feature extraction. FastDepth [36] model achieves fast inference through pruning and optimization, but has limitations in dealing with the details of depth estimation. Exploring how to achieve both excellent performance and fast inference in the network has become the main focus of our research.

The network architecture in monocular depth estimation has a significant impact on the final depth prediction results. Prior to the application of Transformer to vision, most deep learning efforts on monocular depth estimation focused on the design of CNN architectures such as Resnet [28], VGGnet [23], HRnet [29] and Packnet [30]. But these models ignore parameters and inference speed. R-MSFM [34] achieves multi-scale feature learning through feature modulation modules, which uses only the first three stages of ResNet18 [28] as its backbone, offering high efficiency while lacking in deep feature extraction. PydNet [35] designs an unsupervised network capable of performing depth estimation on the CPU, which cannot extract rich hierarchical features despite the low number of parameters. FastDepth [36] model achieves fast inference through pruning and optimization, but has limitations in dealing with the details of depth estimation. The above network are continuously improving the accuracy of depth estimation models, however, CNN models are unable to have a global receptive field when performing convolutional operations, thus making it difficult to retain to specific detailed features. When Transformer is applied in the visual domain, some studies put Transformer into the network architecture to enhance the model performance. For example, [31] uses self-attention to enhance semantic features in a VGG encoder, and MT-SFMLearner [8] points out that while Transformer-based deep estimation architectures have better robustness compared to other CNNs, they hamper operational efficiency. MonoVIT [7] combines convolution with the Transformer module to retain more detailed features, but with a higher parameter, The use of the core module of MSHA in Lite-Mono [9] hinders the fast inference of the model.

Structural reparameterisation techniques

The structure reparameterisation technique was initially proposed and applied to VGG [33] architectures by Ding et al. [29], and its core advantage lies in fully leveraging the network performance while accelerating the inference speed of the network architecture. To enhance network performance, the network employs a multi-branch structure during training, which includes a \(3 \times 3\) convolution layer branch, a \(1 \times 1\) convolution layer branch, and an identity mapping branch. During inference, the network transforms these multi-branches into a single-branch structure, where the identity mapping branch is treated as a \(1 \times 1\) convolution, and the \(1 \times 1\) convolution can be transformed into a \(3 \times 3\) convolution by padding with zeros. Based on the linear additivity of convolution, the resulting \(3 \times 3\) convolution is obtained by summing the three bias vectors. When trying to use RepConv [37] in different positions of the model architecture, we observed that this method did not yield optimal network performance, indicating that the proposed reparameterisation model is not directly suitable for architectures in the field of depth estimation. Therefore, we have designed a new reparameterisation module that integrates well with existing state-of-the-art depth estimation architectures, which provides solution ideas for achieving high network performance and efficient inference.

The structure reparameterisation technique was initially proposed and applied to VGG [33] architectures by Ding et al. [29], and its core advantage lies in fully leveraging the network performance while accelerating the inference speed of the network architecture. To enhance network performance, the network adopts a multi-branch structure during training, which includes a \(3 \times 3\) convolution layer branch, a \(1 \times 1\) convolution layer branch, and an identity mapping branch. During inference, the network transforms these multi-branches into a single-branch structure, where the identity mapping branch is treated as a \(1 \times 1\) convolution, and the \(1 \times 1\) convolution can be transformed into a \(3 \times 3\) convolution by padding with zeros. But given the growth of model parameters, RepVGG [37] is difficult to apply to different types of network architectures. DBB et al. [38] explore six reparameterization methods based on this foundation, which are able to improve the performance of the network model but do not speed up the inference of the model. [29] have used deep convolutions combined with other pointwise convolutions to improve the inference speed of models, reducing the overall number of parameters, but also resulting in a decrease in overall network performance. To the best of our knowledge, the aforementioned reparameterization models are not directly applicable to the architectures in the field of depth estimation. Therefore, we design a new reparameterization module that integrates well with existing state-of-the-art depth estimation architectures, providing a solution for achieving high-performance networks and fast inference.

Fig. 2
figure 2

Overview of our repmono framework. Repmono as a whole is divided into two parts, DepthNet and PoseNet. The depth network encoder uses a large convolutional kernel feature extraction module and a reparameterized token mixer module to extract rich deep features while speeding up inference. PoseNet uses the same as in previous works [23, 31, 39] to estimate the pose between neighboring frames of a monocular image

Proposed framework

Motivation for the design

The encoder in monocular depth estimation network architectures is continuously being improved. Current network architectures focus more on the details of image depth estimation, while also neglecting model size and inference speed. Due to the limited receptive field of the CNN architecture, it is difficult to effectively extract the global information from image. Transformer architecture is known for its powerful extraction of contextual information and also achieves high accuracy, but its unique parallel structure increases model size and limits speed. Mateformer [12] demonstrates the importance of the network architecture in Transformer, and points out that the self-attention module is not all that is needed, testing the plausibility of this in various experiments. Inspired by this, we attempt to design a framework specifically for monocular depth estimation, aiming to achieve both high accuracy and high inference speed. Eventually, we adopt the designed LCKT and RepTM on the encoder-decoder architecture. In LCKT, we employ a large \(7 \times 7\) convolutional kernel to expand the receptive field. Our proposed RepTM achieves spatial information fusion of token mixer through deep convolution, and channel mixer achieves the inter-channel information interaction through \(1 \times 1\) dilated convolution and \(1 \times 1\) projection layer. Compared with multi-head attention, LCKT combined with RepTM can also capture both local and global features, and it is beneficial for accelerating inference. The architecture details is described in detail below.

Deep network architecture

Depth encoder. The encoder-decoder architecture of DepthNet is able to extract features efficiently, as demonstrated in previous work [23, 27]. As shown in Fig. 2, the proposed architecture is divided into four stages. Except for the first stage, all subsequent stages use the same modules. The input image first enters Stage 1, where local features are extracted through a \(3 \times 3\) convolution with a stride of 2, generating feature maps with dimensions \(\frac{H}{2}\times \frac{W}{2}\times C_{1}\). Stage 2 consists of the downsampling layer, LCKT module, and RepTM module. In this stage, the input consists of the concatenation of the feature maps from the previous stage and the feature maps obtained by average pooling of the original image. This structure aims to compensate for the spatial information loss caused by downsampling, resulting in feature maps of size \(\frac{H}{4}\times \frac{W}{4}\times C_{2}\). The third and fourth stages use the same structure, and their inputs also receive feature maps obtained by average pooling of the original image. The feature maps \(\frac{H}{8}\times \frac{W}{8}\times C_{3}\) and \(\frac{H}{16}\times \frac{W}{16}\times C_{4}\) are generated by downsampling the layers, respectively.

Fig. 3
figure 3

LCKT design concept. The token mixer of LCKT is senet and the channel MLP module consists of \(7 \times 7\) large convolution kernel depth convolution, batch normalization module (BatchNorm), \(1 \times 1\) point-by-point convolution and GELU activation function

Large convolution kernel transformer (LCKT). As shown in Fig. 3, the Transformer mainly consists of two parts. One part is a token mixer module based on the self-attention mechanism and the other part is a channel MLP module. Previous research [12] found that performance competitive with the original Transformer could still be maintained across multiple computer vision tasks by replacing the self-attention mechanism in the token mixer with a simpler spatial pooling operation. Therefore, they regarded the self-attention in the token mixer as a specific token mixer, collectively referred to as MetaFormer [12]. To accelerate the inference speed, we replace the token mixer with a simple and efficient SENet [13] module and use batch normalisation across the network to improve stability. Especially in channel MLP module, we introduce depthwise dilated convolution, inspired by the design of the CDC module [9]. By using dilated convolutions with different dilation rates in different stages, feature multi-scale fusion is achieved. Dilated convolutions can be defined by the formula:

$$\begin{aligned} y[a]=\sum _{w=1}^{W} x[a+r \cdot k] h[w] \end{aligned}$$
(1)

where the input is x[a], h[w] is a filter of length k, and r represents the dilation rate, without changing the size of the convolution kernel when r=1. [31] proposes that using deep large convolutions in the network is competitive with the use of a self-attention variant, but leads to a modest increase in inference. We change the initial convolution kernel from \(3 \times 3\) to \(7 \times 7\). Although dilated convolutions can increase the receptive field through the dilation rates, their computation is based on sparse sampling. After dilation, the convolution kernel loses continuity, and the number of sparse samples after dilation of a large convolutional kernel is significantly greater than that of a small convolution kernels, which may affect the continuity of information. In order to fuse the features efficiently in the depth direction and considering the importance of information continuity, we use point-by-point convolution to perform convolution operation on each pixel of the feature map, and then perform a weighted combination in the depth direction. This design aims to work in tandem with large kernels to optimize model performance. Experimental results show that using \(7 \times 7\) large convolution kernel can significantly improve the model performance. Compared to a \(5 \times 5\) convolution, the average error decreases by 6.4%, and compared to a \(3 \times 3\) convolution, the average error decreases by 9.4%. This finding emphasizes the potential of using large convolution kernels in improving model performance and robustness.

Fig. 4
figure 4

Equivalent diagram of RepTM module structure. Where the multi-branch structure during training is converted to a single-branch structure during inference

Reparameterize token mixer (RepTM). As demonstrated in Fig. 4, we replace the token mixer with elements from the lightweight network MobileNet, particularly a \(3 \times 3\) depthwise convolution and a batch normalisation (BatchNorm) layer. The use of this depthwise convolution aims to facilitate the efficient fusion of spatial information within the model. To enhance the flexibility and efficiency of the model, a residual branch and a branch combining \(1 \times 1\) depthwise convolution with BatchNorm layer are concurrently incorporated into the token mixer. According to the linearity of convolution’s additivity formula:

$$\begin{aligned}{} & {} Conv(x,\omega 1)+Conv(x,\omega 2)+Conv(x,\omega 3)\nonumber \\{} & {} \quad =Conv(x,(\omega 1+\omega 2+\omega 3)) \end{aligned}$$
(2)

Where denotes \(Conv(x,\omega )\) the convolution operation with input feature x, convolution kernel \(\omega \), where \(\omega 1\), \(\omega 2\) and \(\omega 3\) are convolution kernels of the same size. This structural parameterization design endows the network enhanced feature extraction capability during the training phase, while the multi-branch structure can be converted into a single \(3 \times 3\) depthwise convolution during the inference phase. This conversion effectively reduces the computational and storage costs introduced due to the parallel structure of the network, thereby improving the computational efficiency of the model during inference. For channel MLP module, the design primarily uses two \(1 \times 1\) convolutions coupled with GELU activation function, facilitating inter-channel information exchange and further boosting the model’s performance.

Depth encoder. We use a bilinear upsampling technique to increase the spatial dimensionality of the model, which is consistent with the approach used in Lite-Mono [9], and integrate the features of the three phases in the encoder through a convolutional layer. After each upsampling block, we set up the prediction header to output inverse depth maps at full resolution, \(\frac{1}{2}\) resolution and \(\frac{1}{4}\) resolution for depth prediction at different accuracy levels.

PoseNet. This study follows the framework established in prior works [23, 31, 39], utilizing a pretrained lightweight ResNet18 [28] to construct PoseNet. The system processes spliced colour image pairs [\(I, I'\)] and accurately estimates the 6-degree-of-freedom relative pose between adjacent frames by means of a three-layered convolutional bit-pose decoder.

Self-supervised learning

Our task is to infer a depth map from a single RGB image alone in the absence of actual depth information. In this process, a depth estimation network generates a depth map \(D_t\) based on a given input image \(I_t\). Simultaneously, the pose estimation network handles temporally adjacent images, computing the relative pose \(T_{t\longrightarrow t'}\) from the target image \(I_t\) to the source image \(I_{t'}\) (where \(t'\) is the previous or subsequent frame of t). The depth map \(D_t\) and the pose \(T_{t\longrightarrow t'}\) are used as supervised signals for efficient training and learning.

Photometric consistency loss. Following previous research [23], we model the learning objective with the aim of minimizing the image reconstruction loss Lp between the original image \(I_t\) and the synthetic target image \(I_{t'}\), Lp is defined:

$$\begin{aligned} Lp=\min _{t'}^{} F(I_t,I_{t'\longrightarrow t}) \end{aligned}$$
(3)

where F() in Eq. (3) represents the photometric reconstruction error, \(I_{t'\longrightarrow t}\) denotes the warp result from image \(I_{t'}\) to image \(I_t\).

(4)

where proj() represents the two-dimensional coordinates obtained by projecting the depth map \(D_t\) onto the image \(I_{t'}\), where is the sampling operator, K is the image-identical camera-internal reference matrix, and we sample the images using differentiable bilinear [40] sampling.

$$\begin{aligned} F(I_a, I_b) = \frac{\alpha }{2} \left( (1-\text {SSIM}(I_a, I_b)) + (1-\alpha ) \Vert I_a - I_b \Vert \right) \nonumber \\ \end{aligned}$$
(5)

the original image is set to \(I_a\), and the reconstructed image obtained by Eq. (4) is set to \(I_b\), \(F(I_a, I_b)\) represents the weighted sum between the Structural Similarity Index Measure (SSIM) [41] and the intensity difference term, where \(\alpha \) is set to 0.85 empirically [27].

Minimum photometric loss. For the occlusion phenomenon in the source image, the minimum photometric loss [27] is computed for each pixel among the losses between adjacent frames in both forward and backward directions.

$$\begin{aligned} L_{\text {SS}}(I_s, I_t) = \min _{i \in [-1,1]} F(I_a, I_b) \end{aligned}$$
(6)

where \(I_s\) denotes the previous or next frame of the target image, \(i \in [-1,1]\) indicates that the image range is forward and backward neighboring frames.

Automatic mask. For dynamic objects in the image, automatic mask [27] is used to filter them out, ensuring that the objects are stationary relative to the camera. Here we denote it by u.

$$\begin{aligned} u = \min _{i \in [-1,1]} L_{\text {SS}}(I_s, I_t) > \min _{i \in [-1,1]} L_{\text {SS}}(I_a, I_b) \end{aligned}$$
(7)

Smoothness loss. In addition, in order for the inverse depth map not to shrink arbitrarily, an edge-aware smoothness loss [27, 29] is utilized.

$$\begin{aligned} L_{\text {smooth}} = |\partial _x d_t^*|\cdot e^{-|\partial _x I|} + |\partial _y d_t^*|\cdot e^{-|\partial _y I|} \end{aligned}$$
(8)

where \(\partial _x\) is the gradient operator in the x direction, \(\partial _y\) is the gradient operator in the y direction, and \(d_t^* = \frac{d_t}{\hat{d}_t}\) denotes the mean normalised inverse depth.

Final total loss. The final total loss L is composed of the total image reconstruction loss \(uL_{\text {SS}}(I_{s}, I_{t})\) and the smoothness loss \(\lambda \cdot L_{\text {smooth}}\).

$$\begin{aligned} L = \frac{1}{3} \sum _{i=1}^{3} (uL_{\text {SS}}(I_{s}, I_{t}) + \lambda \cdot L_{\text {smooth}}) \end{aligned}$$
(9)

where u is the Automatic mask, \(L_{\text {SS}}(I_s, I_t)\) is the minimum luminosity loss, \(\lambda \) weights the smoothness term, we set it to 0.001, which are output from three sizes for final fusion to full resolution.

Experiments

Datasets

KITTI. The KITTI dataset [14] is a stereo vision dataset containing 61 scenes, mainly used for stereo imaging studies. It collects images with dimensions of \(1242 \times 375\), which were captured by a stereo camera system on a LiDAR-equipped vehicle. This study is based on previous studies in the field [23, 27, 29] and uses the image segmentation scheme defined by Eigen et al [17], which consists of 39,810 sets of monocular triple images used for training and 4,424 sets of images used for validation. We evaluated the single-view depth performance on a custom test [19] set, using both the original LiDAR data (697 images in total) and the modified real labels [42] (652 images in total).

Make3D. The Make3D dataset [15] contains mainly images of outdoor environments and is often used to test the generalisation ability of monocular depth estimation frameworks. We tested the Repmono model using the same image preprocessing steps and evaluation criteria as in [27].

DrivingStereo. The DrivingStereo dataset [16] is a large-scale stereo dataset containing real-world driving scenes. The dataset includes a variety of scenes, and the classification of images under different weather scenarios is provided in the official website, and each class of images consists of 500 frames. We use the DrivingStereo dataset containing four types of image scenes: sunny, rainy, cloudy, and foggy. We use this dataset to evaluate our generalization capability in specific driving scenarios.

Table 1 Comparison of Repmono with some recent representative methods on the KITTI benchmark using the Eigen split

Implementation details

In our experiments, the model is implemented based on the Pytorch framework and trained on a server equipped with an NVIDIA A100. Both depth and pose networks are pretrained on ImageNet [27, 34]. The model employs AdamW [34] as the optimizer, with the training period set for 30 epochs, a batch size of 20, and an initial learning rate of 0.0001. For images with resolutions of \(640 \times 192\) and \(1024 \times 320\), the model training was conducted on a single GPU with 40 G of memory. Training the network for \(640 \times 192\) images take approximately 8 h, while for \(1024 \times 320\) images, it take about 12 h and 30 min. In our experiments, we use the same data enhancement strategy as in previous studies [27]. For the final evaluation of the model’s effectiveness, we use seven metrics widely used in the field of depth estimation, including Absolute Relative Difference(Abs Rel), Squared Relative Difference(Sq Rel), Root Mean Square Error (RMSE), Root Mean Square Logarithmic Error(RMSE Log), as well as three accuracy metrics (\(\delta _1< 1.25, \quad \delta _2< 1.25^2, \quad \delta _3 < 1.25^3\)).

KITTI results

We compare our model with existing classic and lightweight models. As shown in Table 1, we show the results of the training method using monocular video (M) and untrained monocular video (M*). For monocular videos (M), we use two sizes: low resolution (\(640 \times 192\)) and high resolution (\(1024 \times 320\)), while the untrained monocular videos are tested only at \(640 \times 192\) resolution. The low-resolution monocular video shows that the overall size of our model is only 2.31M, which is 85% less than Monodepth2 [27] model, 40% less than R-MSFM3 model, and 16% less than Lite-Mono-small model. At the same time, our model surpasses most models in the table in terms of depth accuracy and matches the Lite-Mono-small [9] model, but exceeds Lite-Mono-small [9] in \(\delta _1\) accuracy. Figure 5 provides a more intuitive demonstration of the effectiveness of our proposed solution in depth estimation. We select lightweight network R-MSFM [46], classic network Monodepth2 with ResNet18 [28] encoder, and lightweight network Lite-Mono-small [9] with an improved self-attention mechanism in the encoder for comparison. By comparing, our solution performs better in thin structure and low-light overlapping structures. For example, in the first column of the depth map of the highway signage, depth maps from other models show dragging and incomplete structure issues in such thin structures, whereas our model’s depth map performs better than other methods. In the third column of the figure is poorly distinguished from the background, our model successfully distinguishes the shape of the figure and the background, which has always been a challenging problem for monocular depth estimation. Additionally, we achieve good results for reflective glass structures. In the second column, unlike other models, our model does not exhibit depth blur or structure loss. In the fourth column of the depth map, our model has better coherence for the car contours compared to other methods. This is mainly attributed to our use of LCKT module with a large convolutional kernel and RepTM module with reparameterized structure. The large convolution kernel dilated convolution of our LCKT module can obtain rich multi-scale information, and the RepTM module can extract more detailed features, ultimately enabling our solution to estimate depth more accurately in fine details.

Fig. 5
figure 5

Qualitative results on the KITTI [14] Eigen Split. Comparing our model with some depth maps generated by Monodepth2 [27], R-MSFM3 [46], R-MSFM6 [46], and Lite-Mono-small [9], it can be seen that Repmono better predicts thin objects, reflective mirrors, and low discrimination objects

Fig. 6
figure 6

Qualitative results on the Make3D dataset [15]. Comparing Repmono with Monodepth2 [27] and R-MSFM3 [46], Repmono generalises best

Make3D results

We test our model experimentally on Make3D dataset [15] for generalisability, Our model was trained using 640\(\times \)192 resolution images from the KITTI dataset [14], followed by generalization tests on Make3D [15]. For fairness, we not perform any fine-tuning on Make3D [15], and evaluation is carried out using the criteria proposed in [27]. Table 3 presents the comparison between Repmono and other lightweight models, the comparison shows that Repmono has the smallest number of parameters among all lightweight models, and performs best in all four metrics. Figure 6 shows the superiority of our scheme and that our model can model objects at different scales more accurately. Figure 5 specifically illustrates the advantages of Repmono over other lightweight networks. We select objects at varying distances in three categories of images for comparison. In the third column, the grass represents objects at a relatively close distance, in the first column, the large tree represents objects at a medium distance, and in the second column, the small tree represents objects at a relatively far distance. It can be seen that for objects at different distances in the images, the depth maps produced by Repmono exhibit the best object integrity (Table 2).

Table 2 Comparison of the proposed Repmono with other lightweight models on the Make3D dataset [15]
Table 3 Comparison of Repmono with other lightweight models on the DrivingStereo Dataset [16]

DrivingStereo results

To evaluate the generalization capability of Repmono in specific road scenarios, we test it under four weather conditions from the DrivingStereo dataset [16]. We still choose lightweight models for comparison, and all models are chosen to be trained using \(1024 \times 320\) resolution images from the KITTI dataset [14], and then tested in sunny, cloudy, foggy, and rainy road scenarios. For fairness, all models are not fine-tuned. Table 3 shows the evaluation results of Repmono and other lightweight models under the four weather roads, and it can be seen that compared to other lightweight models, Repmono has the best performance metrics in sunny, cloudy, and foggy days, although it has the smallest number of parameters. Figure 7 specifically shows the depth maps of Repmono and other lightweight models in four weather scenarios. For close-range objects, Repmono’s effect in foggy days is significantly better than that in rainy days, Repmono has the best depth maps of medium-range objects in cloudy scenarios, and it also has the best depth maps of long-range objects in sunny scenarios.

Fig. 7
figure 7

Qualitative results on the DrivingStereo Dataset [16]. Compared to other lightweight models, Repmono shows superior performance under cloudy, foggy, and sunny conditions

Fig. 8
figure 8

Comparison of the inference speed of different models on 1650 device. It can be seen that our model inference is significantly faster than the other models, and is fastest when Batch size is set to 1 and 2

Fig. 9
figure 9

Comparison of the inference speed of different models on A100 device. It can be seen that the inference speed of our model is not much different from that of Monodepth2 [27] and R-MSFM3 [46], and is significantly faster than that of Monovit-small and Monodepth-50, and is fastest when Batch size is set to 8

Speed of reasoning

Testing the inference speed of lightweight architectures is crucial. We conduct inference time test on NVIDIA A100 and NVIDIA 1650 for our model and other lightweight models. To ensure fairness in testing, we standardize the input dimensions to (3, 640, 192) and omit the data preprocessing before entering the model. The inference time is defined as the duration from when the data enters the encoder to when it exits the decoder. The model undergoes a “warm-up” phase before GPU inference to transition it into a working state, thus avoiding errors that could arise if the GPU shifts into power-saving mode during testing. In order to overcome the asynchronous execution of the GPU that leads to inference errors, we set the timer at the time when the GPU receives the data to start the computation and synchronise the CPU and the GPU using the synchronize function of the pytorch library. The synchronize function ensures that the computation between the CPU and GPU is properly coordinated. According to the test results in Figs. 7 and 8, our model demonstrates excellent inference performance on both GPU devices. On NVIDIA 1650, when batch size is set to 1 and 2, our model exhibits the fastest inference speed, 81% faster compared to Monovit-small [7], 60% faster compared to Monodepth2-50 [27], and 27% faster compared to Lite-Mono-small [9]. On NVIDIA A100, our model achieves the fastest inference speed when the batch size is 8. At batch sizes 16 and 32, our model’s inference speed is comparable to the lightweight network R-MSFM3 [46], which demonstrates that our model maintains a fast inference speed when dealing with larger throughput (Fig. 9).

Table 4 Model structure ablation experiments

Ablation experiments on model structures

To evaluate the effectiveness of the model, we execute a series of ablation experiments focused on validating the importance of the LCKT module and RepTM module in the architecture. These experiments are performed without ImageNet [34] pre-training and tested on the KITTI dataset [14] by implementing control variables in the model. From our experiments, when only the LCKT module is used, there is an approximate increase of 2MB parameters and about a 5 ms slowdown in inference speed compared to the baseline model. However, there are improvements in key metrics such as Absolute Relative difference (Abs Rel), Squared Relative difference (Sq Rel), and Root Mean Squared Error (RMSE). When the model integrated both LCKT and RepTM modules, utilizing \(7 \times 7\) convolutions for LCKT, the model’s performance sees the most significant improvement. The model’s performance sees the most significant improvement: the parameters is 2.31M, inference speed is 8.89 ms, Abs Rel is reduced to 0.123, Sq Rel to 0.946, and RMSE to 5.017, while exhibiting optimal performance on accuracy metrics \(\delta _1< 1.25, \quad \delta _2< 1.25^2, \quad \delta _3 < 1.25^3\). Comparing with models using \(3 \times 3\) and \(5 \times 5\) convolutional kernels, although inference speed is faster, they perform poorly on accuracy metrics, validating the effectiveness of \(7 \times 7\) convolutions in enhancing model performance. Overall, a robust balance is achieved among model size, speed, and accuracy (Table 4).

Conclusion

This paper introduces a lightweight self-supervised monocular depth estimation model named Repmono, and its two supporting modules, LCKT and RepTM. This approach effectively combines large convolutions and reparameterization structures, enabling Repmono to achieve high performance and fast inference without relying on core Transformer modules. Compared to the classical model Monodepth2, Repmono reduces the number of parameters by 83.8% and speeds up inference by 60.1%. Repmono alleviates the issues of high complexity and slower inference speeds associated with the use of Transformer structures in depth estimation networks, although there is still room for improvement in the network’s accuracy. Our approach is extensively tested on the KITTI dataset to validate its effectiveness and demonstrated good generalization on the Make3D dataset and Drivingstereo dataset. We hope our proposed method can contribute to the advancement of self-supervised monocular depth estimation and inspire ideas for subsequent research.