1 Introduction

Visual SLAM is vital for autonomous robots, with applications ranging from autonomous driving [1] to 3D reconstruction [2], navigation [3], and high-definition mapping [4]. Monocular SLAM, enabled by low-cost cameras, is crucial in this domain [5]. However, scale ambiguity often leads to significant scale drift during long-term operation, particularly in dynamic scenes [6].

Addressing scale drift in monocular SLAM usually involves providing depth information for image pixels [7]. Traditionally, this required additional sensors like stereo cameras or LiDAR [8], which increases costs and complexity. Recent advances in monocular depth estimation networks offer a promising alternative. While supervised methods require expensive LiDAR for training, self-supervised approaches, like Hr-depth [9] and Monodepth2 [10], learn from geometric projection matching without depth ground truth.

Nevertheless, most monocular depth estimation networks failed to produce consistent depth maps between frames, leading to tracking failures in SLAM [11]. In dynamic scenes, the depth estimation of dynamic objects was highly inaccurate and their edges appeared blurred, primarily attributed to the violation of the geometric consistency assumption [12]. Moreover, monocular estimation methods lacked accurate scale estimation and struggled to perceive distance between objects and the camera, leading to increased inaccuracy in depth estimation for dynamic objects, particularly in scenes with low discriminability, such as highways [7].

This paper proposes Dyna-MSDepth, a novel self-supervised monocular depth estimation network, aimed at addressing the aforementioned challenges. As shown in Fig. 1, Dyna-MSDepth produces a stable and reliable depth map with sharp edge segmentation and scale consistency in dynamic scenes, enabling direct utilization for monocular SLAM scale recovery. Dyna-MSDepth employs scale consistency loss to establish connections between depth values in consecutive frames. Additionally, a pre-trained supervised model provides depth priors, enabling depth value prediction on dynamic objects and segmentation of dynamic object edges. To perceive objects at varying distances, multi-scale input is employed in the monocular sequences, while high-order spatial interaction enhances feature fusion. Dyna-MSDepth is extensively evaluated on four challenging dynamic datasets, including KITTI, TUM, DDAD, and BONN. The results demonstrate that Dyna-MSDepth outperforms existing State-Of-The-Art (SOTA) approaches both qualitatively and quantitatively. Moreover, the dense depth maps estimated by Dyna-MSDepth are directly utilized for scale recovery in ORB-SLAM3, resulting in significantly reduced scale drift in the KITTI dataset.

Fig. 1
figure 1

Dyna-MSDepth is evaluated against SOTA methods on 4 challenging dynamic datasets. Top row: the original images from the TUM, BONN, KITTI, and DDAD datasets, respectively. Second row: the monocular depth estimation results of other SOTA methods. The left two columns are the results of [13], and the right two columns are the results of [11]. Third row: the monocular depth estimation results of Dyna-MSDepth. Bottom row: the ground truth depth maps provided by the datasets

In summary, the contributions of this paper are:

  1. 1.

    Dyna-MSDepth is proposed as a solution for multi-scale monocular depth estimation in dynamic scenes, enabling direct utilization of the estimated depth maps for scale recovery in monocular SLAM to mitigate scale drift;

  2. 2.

    The performance of Dyna-MSDepth is assessed through qualitative and quantitative evaluations on four challenging dynamic datasets, i.e., KITTI, TUM, DDAD, and BONN;

  3. 3.

    The effectiveness of Dyna-MSDepth’s depth map is demonstrated through its application in monocular SLAM, with evaluations conducted on the KITTI datasets.

2 Related work

This chapter presents an overview of the relevant researches. Firstly, the scale drift problem in monocular SLAM is addressed (Sect. 2.1). Secondly, the current research status of monocular depth estimation networks is analyzed (Sect. 2.2). Finally, the significance of multi-scale approaches in self-supervised monocular depth estimation is discussed (Sect. 2.3).

2.1 Scale drift in monocular SLAM

Visual SLAM encompasses monocular [7], stereo [14], and RGBD [15] techniques, each employing distinct sensor setups. Monocular SLAM is preferred due to its cost-effectiveness and simplicity [5]. Nonetheless, it suffers from scale ambiguity, leading to significant scale drift over time. Consequently, this limitation severely impacts subsequent tasks such as localization, mapping, 3D reconstruction, and navigation [16]. To address the single-purpose scale drift, [8] and [17] incorporated ridar and IMU sensors. [18] assumed a known camera height from the ground. [19] and [7] employed costly global bundle adjustment after calculating the Sim(3) transformation. Several learning-based visual odometry methods aimed to directly recover the absolute scale in the scene [20,21,22]. However, their accuracy lagged behind traditional multi-view geometry-based SLAM algorithms due to the absence of bundle adjustment, feature search, and loop closing [23].

2.2 Self-supervised monocular depth estimation

Monocular depth estimation methods in deep learning are categorized into three approaches. The first category involves supervised training using ground truth obtained from LiDAR and RGB-D cameras. [24] leveraged high-order three-dimensional geometric constraints and depth truth values to enhance depth prediction accuracy. [25] improved supervised training performance through the incorporation of attention mechanisms. These approaches necessitate costly depth truth data. The second category utilizes calibrated binocular cameras to project depth maps from the left to the right camera and subsequently calculates photometric loss. [26] addressed the uncertainties associated with stereo depth estimation. While this method can provide metric depth, it requires precise camera calibration. The third category explores unsupervised or weakly supervised methods to learn depth prediction. [10] was the earliest and most typical self-supervised monocular depth estimation method. [27] enhanced the performance of self-supervised method through semantic guidance. [28,29,30] investigated the impact of physical world attacks on self-supervised monocular depth estimation and proposed several effective countermeasures. The use of self-supervised methods, which do not rely on depth ground truth, is prevalent. To address the challenges of scale inconsistency, dynamic object estimation, and scale ambiguity in self-supervised monocular depth estimation, [11] had proposed scale consistency loss to ensure continuity of inter-frame depth values. Additionally, [12] had introduced pseudo depth approach to tackle the depth estimation problem of dynamic objects. These advancements contribute to the growing popularity of self-supervised monocular depth estimation.

2.3 Visual SLAM combined with depth estimation

Acquiring camera pose and maintaining global consistency mapping in dynamic environments poses significant challenges. Numerous methodologies have been developed to address this issue. [31] employed SegNet for semantic segmentation, effectively eliminated feature points within dynamic regions. [32] combined MaskRCNN with ORBSLAM2 [19] to detect moving vehicles, thereby mitigating the impact of dynamic obstacles on static background pose estimations. [33] leveraged semantic information to aid in epipolar geometry calculation within dynamic environments, overcoming noise interference in motion information between adjacent frames. [34] integrated object detection and depth information for dynamic feature recognition, achieving performance levels comparable to semantic segmentation. Moreover, [34] employed IMU for motion prediction in feature tracking and motion consistency checking. [35] utilized the interdependence between camera motion and optical flow, optimizing them jointly within a unified learning framework in dynamic environments. [36] optimized static scenes, dynamic object structures, and camera pose simultaneously, facilitating the decoupling and estimation of three-dimensional bounding boxes for dynamic objects within fixed time windows. While existing methodologies effectively handle dynamic feature points and optical flow, enhancing positioning accuracy and robustness in dynamic environments, monocular SLAM invariably suffers from significant scale drift regardless of the approach employed.

2.4 Multi-scale models

Image pyramids are extensively employed in computer vision tasks to enable models to handle multi-resolution and multi-scale information [7, 19, 37]. Image pyramids simulate various object distances, which is crucial for handling the scale blurriness in monocular vision. It aids monocular depth estimation and SLAM in perceiving depth values for objects of different sizes. [38] employed multi-scale input and attention mechanism to enhance spatial perception in monocular depth estimation. [39,40,41] captured fine-grained images with multi-scale input. [42] mitigated the impact of object size differences on the Convolutional Neural Network (CNN) model through multi-scale input. While the CNN model establishes multi-scale through downsampling in its backbone, it pertains to the feature layer post complex operations rather than the original RGB image [43, 44]. By incorporating multi-scale input, Dyna-MSDepth effectively emulates objects at different distances and scales, thereby improving the accuracy of dynamic object depth estimation.

3 Method

This section presents the principles of Dyna-MSDepth. Firstly, Sect. 3.1 introduces the principle of self-supervised monocular depth estimation. Section 3.2 further analyzes the theory of depth ranking, which serves as the foundation for Dyna-MSDepth in estimating depth in dynamic scenes. Section 3.3 proposes a multi-scale depth estimation module that incorporates high-order spatial interaction, leading to improved performance compared to the baseline. Building upon these foundational theories, Sect. 3.4 introduces the specific architecture of Dyna-MSDepth for self-supervised multi-scale monocular depth estimation in dynamic scenes.

3.1 Self-supervised monocular depth estimation

The self-supervised monocular depth estimation approach consists of two components: DepthNet for depth estimation and PoseNet for 6D pose estimation [10]. During the training process, given a pair of monocular images \(({I_x},{I_y})\), DepthNet estimates the dense depth map \(({D_x},{D_y})\), while PoseNet estimates the relative pose \({P_{xy}}\) between the image pair. Subsequently, \(D_y\) is employed to generate the simulated image \(I_x^{\prime }\) of the previous frame using \({P_{xy}}\) projection. By calculating the pixel difference between \(I_x\) and \(I_x^{\prime }\), the self-supervised monocular depth estimation method is trained.

Several monocular depth estimation networks lack the ability to produce a consistent and densely connected depth map, resulting in discontinuous depth values between frames [45, 46]. To address this issue, Dyna-MSDepth incorporates a consistency loss \(L_G\) from [11] to enforce the continuity of depth values across frames, thereby enhancing the stability of downstream tasks like SLAM. For the point p in \(D_y\) that is successfully projected to \(D_x\), the geometric inconsistency of p between the synthetic depth map \(D_x^y\) and the target depth map \(D_x^{\prime }\) can be calculated:

$$\begin{aligned} {D_{\mathrm{{diff}}}} = \frac{{{{\left\| {D_x^{\prime }(p) - D_x^y(p)} \right\| }_2}}}{{D_x^{\prime }(p) + D_x^y(p)}}, \end{aligned}$$
(1)

while \(D_x^y\) represents the depth map obtained by projecting \(D_y\) using \({P_{xy}}\), and \(D_x^{\prime }\) denotes the depth map produced by interpolation aligned with \(D_x^y\). Then, the geometric consistency loss \(L_G\) is calculated as:

$$\begin{aligned} {L_G} = \frac{1}{{\left| U \right| }}\sum \limits _{p \in P} {{D_{\mathrm{{diff}}}}(p)} = \frac{{{{\left\| {D_x^{\prime }(p) - D_x^y(p)} \right\| }_2}}}{{D_x^{\prime }(p) + D_x^y(p)}}, \end{aligned}$$
(2)

where U represents the effective projection points. The \(L_G\) penalty was applied during the training process to ensure depth consistency within each batch, resulting in continuous depth maps for the entire image sequence.

Equation 2 normalizes depth map inconsistencies through summation. When the dynamic objects appear in training frames, loss \(L_G\) rapidly increases due to multi-view consistency assumption violation. Hence, Dyna-MSDepth cites a weight parameter [13]:

$$\begin{aligned} {M_s} = 1 - {D_{\textrm{diff}}} \end{aligned}$$
(3)

The parameter \(M_s\), ranging from 0 to 1, denotes the gradient proportion from the loss function \(L_G\) in training. A smaller A implies higher likelihood of dynamic objects in the region, necessitating decreased contribution to the loss function \(L_G\).

Additionally, Dyna-MSDepth employs a photometric loss \(L_P\) from [11], which is weighted, to constrain \(I_x\) and \(I_x^{\prime }\):

$$\begin{aligned} {L_P}= & {} \frac{1}{{\left| U \right| }}\sum \limits _{p \in U} (\lambda \left\| {{I_x}(p) - I_x^{\prime }(p)} \right\| \nonumber \\{} & {} + (1 - \lambda )\frac{{1 - \mathrm{{SSI}}{\mathrm{{M}}_{x{x^{\prime }}}}(p)}}{2}), \end{aligned}$$
(4)
$$\begin{aligned} L_P^M= & {} \frac{1}{{\left| U \right| }}\sum \limits _{p \in U} {({M_s}(p){L_P}(p))}, \end{aligned}$$
(5)

Finally, Dyna-MSDepth incorporates an edge-aware smoothing loss \(L_S\) from [13] to effectively regularize the depth map:

$$\begin{aligned} {L_S} = \sum \limits _p {{{({e^{ - \nabla {I_x}(p)}}\nabla {D_x}(p))}^2}}, \end{aligned}$$
(6)

where \(\nabla \) denotes the first derivative with respect to the spatial dimension.

In cases where the dynamic object occupies a minority of image pixels, the loss function can include geometric consistency loss, photometric loss, and edge smoothing loss:

$$\begin{aligned} {L_{\mathrm{{self}}}} = \alpha {L_G} + \beta L_P^M + \gamma {L_S} \end{aligned}$$
(7)

The weights of three losses are denoted as \(\alpha \), \(\beta \), and \(\gamma \), respectively.

3.2 Dynamic region refinement

Supervised monocular depth estimation is not directly applicable for dynamic scenes, but it can assist unsupervised monocular depth estimation [27]. The supervised network, trained on large-scale datasets, offers advantages in depth ordinal, depth value smoothness, and sharp object edges. It particularly aids in training unsupervised monocular depth estimation networks in dynamic scenes, effectively capturing near-far point relationships of dynamic objects [12]. To address dynamic scenes and ensure fair comparisons, this study adopted the approach of [12] for handling dynamic objectives during self-supervised training. [12] employed a fully supervised model to generate a depth map as a priori for self-supervised training. It introduces Depth Ranking Loss \(L_{DR}\) to regulate the proximity of dynamic objects to the static background, and employs smoothing loss \(L_N\) to promote depth map continuity. Notably, executing the supervised network only once at the training onset minimizes additional training and inference costs.

The full supervised network predicts the depth map of the current RGB image, serving as the pseudo-depth truth value during training. \(M_s\) calculates geometric inconsistency, dividing the dynamic region. Restricting depth estimation of dynamic regions and enhancing the relationship between far and near points in static regions improve depth prediction for dynamic regions [12]. Model extracts the depth ranking from the pseudo-depth map, constraining network-predicted depth through Depth Ranking Loss \(L_{DR}\):

$$\begin{aligned} \eta ({p_0},{p_1})= & {} \log (1 + \exp ( - l({p_0} - {p_1}))), \end{aligned}$$
(8)
$$\begin{aligned} {L_{DR}}= & {} \frac{1}{{\left| \Phi \right| }}\sum \limits _{p \in \Phi } {\eta (p)}, \end{aligned}$$
(9)

where l represents the ordinal label provided by the depth prior, and \(\Phi \) represents all sampling point pairs.

To enhance the smoothness of the estimated depth map, Dyna-MSDepth incorporates a calculation of the surface normal by comparing the predicted depth map with the corresponding depth prior. This calculation can be further optimized as follows [12]:

$$\begin{aligned} {L_N} = \frac{1}{N}\sum \limits _{i = 1}^N {\left\| {{n_i} - n_i^{*}} \right\| }, \end{aligned}$$
(10)

where \(n_i\) represents the surface normal obtained from the predicted depth, while \(n_i^{*}\) represents the normal derived from pseudo-depth. N denotes the total number of pixels in the image.

In the depth prior generated by supervised monocular depth estimation, the edges of the object are segmented very sharp. In order to segment dynamic objects well in the self-supervised monocular depth estimation network, Dyna-Depth further applies the edge normal loss [12], so that the estimated depth map is consistent with the relative normal angle of the edge point pair in the depth prior:

$$\begin{aligned} {L_{EN}} = \frac{1}{N}\sum \limits _{i = 1}^N {\left\| {{n_{Ai}}{n_{Bi}} - n_{Ai}^{*}n_{Bi}^{*}} \right\| }, \end{aligned}$$
(11)

where \(n_{Ai}\) represents the normal of the estimated depth map’s point pairs, while \(n_{Ai}^{*}\) represents the normal of the point pairs provided by the depth prior.

Thus, the total loss of Dyna-MSDepth is formulated as:

$$\begin{aligned} L = \alpha {L_G} + \beta L_P^M + \gamma {L_N} + \varphi {L_{DR}} + \vartheta {L_{EN}}, \end{aligned}$$
(12)

where \(\varphi \) and \(\vartheta \) are weights respectively to punish different losses.

3.3 Multi-scale DepthNet

Current configuration restricts Dyna-MSDepth to fixed-resolution image processing, limiting scale adaptability. Monocular scenes often require capturing objects at varying distances from the camera’s optical center, making a single scale insufficient [47]. Employing multi-scale input enhances information coverage for objects of different scales. Additionally, single-scale methods often encounter discontinuities due to crossing large depth disparities [48]. Leveraging multi-scale information smooths depth images, reducing depth value inconsistencies. Furthermore, multi-scale information effectively mitigates image noise [49], artifacts, and enhances depth estimation accuracy and robustness.

Hence, to enhance the capabilities of DepthNet in Dyna-MSDepth, an additional branch is incorporated to handle RGB images with varying resolutions. When the monocular image is inputted to the network, the original resolution image is fed into the existing DepthNet backbone, while a down-sampled low-resolution image is directed to the new branch. Consequently, Dyna-MSDepth exhibits multi-scale characteristics. Moreover, most monocular depth estimation networks utilize ResNet as their backbone, which, although efficient and lightweight, lacks sufficient interaction among high-order spatial features across different levels. This limitation hampers its potential for improved accuracy and robustness. Consequently, this section aims to devise a novel multi-scale DepthNet, integrating the latest gConv model for high-order spatial feature interaction.

Fig. 2
figure 2

The architecture of Dyna-MSDepth. It consists of three parts: self-supervised monocular depth estimation, dynamic region optimization, and multi-scale input

The significance of gated convolution (gConv) [50] within Dyna-MSDepth lies in its pivotal role in facilitating multi-scale input and high-order spatial interaction. Assuming that \({{\textbf {u }}} \in {{{\mathbb {R}}}^{H \times W \times C}}\) represents the input layer of gConv, the corresponding output feature layer \({\textbf {y }}\) is mathematically expressed as follows:

$$\begin{aligned}{} & {} [{{\textbf {a }}}_0^{H \times W \times C},\mathrm{{ }}{{\textbf {b }}}_0^{H \times W \times C}] = {\phi _{\mathrm{{in}}}}({{\textbf {u }}}) \in {{{\mathbb {R}}}^{H \times W \times 2C}}, \end{aligned}$$
(13)
$$\begin{aligned}{} & {} {{{\textbf {a }}}_1} = f({{{\textbf {b }}}_0}) \odot {{{\textbf {a }}}_0} \in {{{\mathbb {R}}}^{H \times W \times C}}, \end{aligned}$$
(14)
$$\begin{aligned}{} & {} {{\textbf {y }}} = {\phi _{\mathrm{{out}}}}({{{\textbf {a }}}_1}) \in {{{\mathbb {R}}}^{H \times W \times C}}, \end{aligned}$$
(15)

where \(\phi _{in}\) and \(\phi _{out}\) are the linear projection processes, and f represents depth-wise convolution, and \({\textbf {a }}_0\) and \({\textbf {b }}_0\) represent the intermediate features in the gConv.

The mathematical description of \({\textbf {a }}_1\) can be further refined as:

$$\begin{aligned} a_1^{(i,c)} = \sum \nolimits _{j \in {\Psi _i}} {\omega _{i \rightarrow j}^c} b_0^{(j,c)}a_0^{(i,c)}, \end{aligned}$$
(16)

where \(\Psi _i\) is a local window centered at i, and \(\omega \) is the weight of depth-wise convolution. By utilizing element-wise multiplication, the interaction between \(a_0^{i}\) and \(b_0^{j}\) at the 1-order spatial level enhances the model’s representation capability.

After realizing the 1-order spatial interaction, Dyna-MSDepth further extends it to high-order spatial interaction [50], so that it can learn stronger features. \(\phi _{in}\) is used to further extract high-dimensional features:

$$\begin{aligned}{} & {} [{{\textbf {a }}}_0^{H \times W \times {C_0}},\mathrm{{ }}{{\textbf {b }}}_0^{H \times W \times {C_0}},\mathrm{{ }} \ldots \mathrm{{, }}{{\textbf {b }}}_{n - 1}^{H \times W \times {C_{n - 1}}}] \nonumber \\{} & {} \quad = {\phi _{\mathrm{{in}}}}({{\textbf {u }}}) \in {{{\mathbb {R}}}^{H \times W \times {C_0} + \sum \nolimits _{0 \le k \le n - 1} {{C_k}}}}. \end{aligned}$$
(17)

At this stage, the concept of 1-order spatial interaction can be expanded recursively to include higher-order spatial interaction, referred to as \({\mathrm{{g}}^n}\mathrm{{Conv}}\):

$$\begin{aligned} {{{\textbf {a }}}_{k + 1}} = {f_k}({{{\textbf {b }}}_{k + 1}}) \odot {q_k}({{{\textbf {a }}}_k})/\alpha ,\mathrm{{ }}k = 0,1, \ldots ,n - 1, \end{aligned}$$
(18)

where \(\alpha \) is a scale factor used to ensure training stability, and \(f_k\) represents k depth-wise convolution layers, and \(q_k\) is utilized for dimension mapping across various feature layers, facilitating dimension matching:

$$\begin{aligned} {q_k} = \left\{ {\begin{array}{*{20}{l}} {\mathrm{{Identity,}}}\\ {\mathrm{{Linear(}}{C_{k - 1}},{C_k}\mathrm{{)}}}.\end{array}} \right. \end{aligned}$$
(19)

The recursive process yields a final output that is passed to \(\phi _{out}\) for output mapping, denoted as \({\mathrm{{g}}^n}\mathrm{{Conv}}\). Additionally, to reduce computational overhead, feature dimensionality can be optimized as follows:

$$\begin{aligned} {C_k} = \frac{C}{{{2^{n - k - 1}}}},\mathrm{{ }}0 \le k \le n - 1. \end{aligned}$$
(20)

Indeed, the computation of \({\mathrm{{g}}^n}\mathrm{{Conv}}\) entails a coarse-to-fine strategy. Additionally, the computation of \({\mathrm{{g}}^n}\mathrm{{Conv}}\) does not experience a substantial increase as n becomes larger:

$$\begin{aligned} \mathrm{{FLOPs(}}{\mathrm{{g}}^n}\mathrm{{Conv) < }}H \times W \times C\mathrm{{(2}} \times {K^2}\mathrm{{ + 11/3}} \times C + 2\mathrm{{),}}\nonumber \\ \end{aligned}$$
(21)

where K is the convolution kernel size of depth-wise convolution.

The high-order spatial interaction principle of \({\mathrm{{g}}^n}\mathrm{{Conv}}\) exhibits similarity to the attention mechanism in Transformers. The Transformer architecture, which includes the multi-head self-attention mechanism, has demonstrated promising results [51, 52]. However, it suffers from quadratic complexity in terms of the feature layer size [50]. In contrast, the \({\mathrm{{g}}^n}\mathrm{{Conv}}\) model not only achieves advanced high-order spatial interactions but also reduces computational burden. This reduction is particularly important for successful Monocular depth estimation.

3.4 Model architecture

The architecture of Dyna-MSDepth is illustrated in Fig. 2. The input for Dyna-MSDepth is continuous RGB image frames. When feeding the image pair \(({I_x},{I_y})\), three data streams are generated. Firstly, a pre-trained supervised network predicts the current frame. The supervised network establishes a stable near-far relationship between dynamic objects and static backgrounds, providing sharp object edges. It should be noted that the supervision network runs only once during the entire training process, minimizing training costs. Secondly, the image pair \(({I_x},{I_y})\) undergoes depth map extraction through the multi-scale DepthNet, enabling multi-scale, multi-resolution, and high-order spatial interaction feature fusion. Thirdly, PoseNet processes the image pair \(({I_x},{I_y})\) to determine the relative pose between the images. Using the estimated depth map and relative pose, the loss is computed according to Eq. 12, and gradient backpropagation is applied to complete the training process. Following training, Dyna-MSDepth generates a smooth, stable, sharp-edged, and consistent depth map during the forward inference process. This depth map is qualitatively and quantitatively evaluated, and then utilized in monocular SLAM to mitigate scale drift.

4 Experiment

Dyna-MSDepth, proposed in this paper, aims to estimate reliable multi-scale depth maps in dynamic scenes to restore scale consistency in visual SLAM. This chapter introduces four challenging dynamic datasets (TUM, KITTI, BONN, DDAD) for evaluating the depth estimation performance of Dyna-MSDepth in Sect. 4.1, along with the evaluation metrics in Sect. 4.2. Dyna-MSDepth is trained on these datasets, and the results in Sect. 4.4 demonstrate its superior performance compared to existing leading-edge methods. Furthermore, the estimated depth maps are directly integrated into the visual SLAM system, showcasing their practical applicability through qualitative and quantitative assessments in Sect. 4.5.

Fig. 3
figure 3

Qualitative comparison results of Dyna-MSDepth and other SOTA schemes on the KITTI dataset. Top row: the original images from the KITTI dataset. Second row: the monocular depth estimation results of [11]. Third row: the monocular depth estimation results of [12]. Fourth row: the monocular depth estimation results of Dyna-MSDepth. Bottom row: the ground truth depth maps provided by the KITTI dataset

Table 1 Quantitative comparison results of Dyna-MSDepth and other SOTA schemes on the KITTI dataset

4.1 Datasets

KITTI. The KITTI dataset serves as a common benchmark for SLAM, monocular depth estimation, and object detection. Certain sequences in the KITTI dataset pose challenges due to the presence of numerous dynamic objects, leading to potential failures in SLAM feature point tracking and optical flow estimation [7]. Furthermore, monocular depth estimation is adversely affected by the presence of dynamic objects. To mitigate this issue, it becomes necessary to perform monocular depth estimation on the KITTI dataset while accounting for the influence of dynamic objects. In line with previous work [53], 697 images were utilized from the KITTI dataset for testing and the remaining images for training. Image resolution was scaled from 1241\(\times \)376 to 832\(\times \)256.

TUM. Similarly, the TUM dataset serves as a crucial benchmark for indoor SLAM systems. Notably, the dynamic sequences in the TUM dataset contain a significant number of dynamic objects that occupy a substantial portion of the image. Conventional SLAM approaches tend to suffer from tracking losses or drift in such dynamic sequences. Therefore, it is imperative to estimate the depth of monocular images in a manner that accounts for the influence of dynamic objects in the TUM dataset. Additionally, the TUM dataset provides depth ground truths, enabling the evaluation of estimated depth maps. For evaluation purposes, the last two dynamic sequences from the TUM dataset were reserved for testing, while the remaining sequences were used for training. Image resolution was scaled from 640\(\times \)480 to 320\(\times \)256.

BONN. The BONN dataset comprises 26 indoor sequences featuring fast-moving individuals and other objects. The primary distinction of the BONN dataset lies in the speed of object movement when compared to the TUM dataset. The Pre-set test set was selected for evaluation while employing the remaining sequences for training [12]. Image resolution was scaled to 320\(\times \)256 to maintain consistency.

DDAD. DDAD, a comprehensive dataset comprising 200 sequences, presents a significant challenge for monocular depth estimation and SLAM due to the predominant presence of moving vehicles. Unlike the KITTI dataset, the DDAD dataset primarily consists of dynamic scenes. The DDAD dataset follows the standard training set/test set segmentation, including 12,650 training images and 3950 test images, all scaled to a resolution of 640\(\times \)384.

4.2 Evaluation metrics

The evaluation of Dyna-MSDepth encompasses two aspects: monocular depth estimation and SLAM.

For monocular depth estimation, standard evaluation metrics were employed such as mean absolute relative error (AbsRel), root mean squared error (RMS), and accuracy under threshold\(({\delta _i} < {1.25^i},i = 1,2,3)\). To ensure consistent evaluation, the scale was restored before calculating these metrics. Additionally, this paper used a novel evaluation approach that distinguishes dynamic and static regions based on semantic segmentation [12]. Specifically, this study utilize a pre-trained semantic segmentation network to identify dynamic objects (people and cars) in the four datasets. Subsequently, compute the monocular depth estimation metrics separately for dynamic and static regions.

Regarding the evaluation of the SLAM system, this paper focus on the KITTI datasets. On the KITTI dataset, the primary challenge for monocular SLAM is the scale drift resulting from long sequences. Hence,the extent of scale drift reduction achieved by incorporating the depth map from Dyna-MSDepth estimation was assess, along with the average trajectory accuracy after scale alignment.

4.3 Implementation details

During the training process, the initial learning rate for the four datasets was set to 1e-4 and multiplied by 0.8 every 10 epochs. The batch size for the KITTI and DDAD datasets in outdoor scenes is set to 8, while the batch size for the TUM and BONN datasets in indoor scenes is set to 4. \(\alpha =1\), \(\beta =0.5\), and \(\gamma =\varphi =\vartheta =0.1\). The pre-trained LeReS [54] model was utilized to generate a depth prior and MSeg [55] was utilized for generating dynamic object masks for evaluation.

4.4 Depth estimation results

Figure 3 presents the qualitative comparison results between Dyna-MSDepth and other cutting-edge techniques on the KITTI dataset. The depth map generated by Dyna-MSDepth outperforms the previous method. Specifically, the depth estimation for small objects in [11] is highly inaccurate, while for low texture regions in [12], although improved compared to [11], it still falls short of the accuracy achieved by Dyna-MSDepth. Table 1 summarizes the quantitative evaluation, demonstrating the significant performance enhancement of Dyna-MSDepth compared to SC-DepthV3 [12]. Notably, Dyna-MSDepth achieves a 3\(\%\) improvement in accuracy for dynamic regions. Enhancing performance in small object, low texture, and long distance depth estimation is achieved through the utilization of multi-scale input in Dyna-MSDepth.

Fig. 4
figure 4

Qualitative comparison results of Dyna-MSDepth and other SOTA schemes on the TUM dataset. Top row: the original images from the TUM dataset. Second row: the monocular depth estimation results of [13]. Third row: the monocular depth estimation results of [12]. Fourth row: The monocular depth estimation results of Dyna-MSDepth. Bottom row: the ground truth depth maps provided by the TUM dataset

Table 2 Quantitative comparison results of Dyna-MSDepth and other cutting-edge methods on the TUM dataset

Figure 4 presents the qualitative comparison results of Dyna-MSDepth and other state-of-the-art approaches on the TUM test set. The findings demonstrate that depth estimation without a dynamic object loss function yields poor results for dynamic objects. In particular, nearby points fail to estimate accurate depth values, resulting in blurred boundaries and potential SLAM failure. In contrast, SC-DepthV3 [12] exhibits commendable depth estimation for dynamic regions, yet it still experiences missed detection for other close-range dynamic objects (e.g., hands and heads). Additionally, Dyna-MSDepth shows superior performance in capturing low-texture structures at relatively distant distances, showcasing the benefits of multi-scale input. The quantitative indicators in Table 2 also verify this conclusion.

Figure 5 illustrates the qualitative comparison between Dyna-MSDepth and other leading-edge methods on the BONN dataset. The findings reveal that without optimizing the dynamic region loss function, monocular depth estimation results of Dyna-MSDepth for dynamic targets are notably poor, with the depth value of dynamic objects closely linked to the static background. Additionally, due to the lack of multi-scale characteristics, the depth estimation results for the static background are subpar. In contrast, the current SOTA method, SC-DepthV3 [12], exhibits significant improvement in deep optimization for dynamic regions. Nonetheless, there are still limitations in SC-DepthV3’s depth estimation results when dynamic objects enter the scene or when two moving objects overlap, making the extraction of object edges less distinct. Conversely, Dyna-MSDepth demonstrates superior performance in handling dynamic objects, and its multi-scale input enables precise capturing of intricate details of distant objects like tables and chairs. Table 3 presents the quantitative comparison results of Dyna-MSDepth on the BONN dataset, indicating its overall superiority over SC-DepthV3, particularly in terms of dynamic region accuracy, which exhibits a 1.2\(\%\) enhancement.

Fig. 5
figure 5

Qualitative comparison results of Dyna-MSDepth and other SOTA schemes on the BONN dataset. Top row: the original images from the BONN dataset. Second row: the monocular depth estimation results of [13]. Third row: the monocular depth estimation results of [12]. Fourth row: the monocular depth estimation results of Dyna-MSDepth. Bottom row: the ground truth depth maps provided by the BONN dataset

Table 3 Quantitative comparison results of Dyna-MSDepth and other leading-edge methods on the BONN dataset

Figure 6 illustrates the qualitative comparison results of Dyna-MSDepth and other state-of-the-art techniques on the DDAD dataset. The depth map produced by method [11] exhibits poor discrimination between near and far points, resulting in inaccurate depth estimation for dynamic objects. On the other hand, SC-DepthV3 yields sharp edges for dynamic objects but struggles with accurate depth estimation for distant low-texture scenes and objects. In contrast, leveraging multi-scale input, Dyna-MSDepth demonstrates improved depth estimation for small objects in the distance. The comparison results on the DDAD dataset (Table 4) reveal that Dyna-MSDepth, discussed in this study, exhibits slightly lower accuracy in dynamic regions compared to SC-DepthV3, albeit with similar processing methods. However, Dyna-MSDepth’s multiscale architecture notably enhances its ability to capture detailed textures and distant depth, leading to superior overall performance when compared to SC-DepthV3.

4.5 SLAM test results

In outdoor settings like autonomous driving, SLAM systems require prolonged operation, exacerbating scale and trajectory drift issues. To assess Dyna-MSDepth’s efficacy in mitigating these challenges, this study conducts experiments with ORB-SLAM3 on the KITTI dataset in monocular and RGB-D modes, evaluating positioning accuracy and scale drift. Loop detection is disabled in both experiments to better gauge the depth map’s impact on scale restoration.

Fig. 6
figure 6

Qualitative comparison results of Dyna-MSDepth and other SOTA schemes on the DDAD dataset. Top row: the original images from the DDAD dataset. Second row: the monocular depth estimation results of [11]. Third row: the monocular depth estimation results of [12]. Fourth row: the monocular depth estimation results of Dyna-MSDepth. Bottom row: the ground truth depth maps provided by the DDAD dataset

Table 4 Quantitative comparison results of Dyna-MSDepth and other cutting-edge methods on the DDAD dataset

The KITTI dataset, commonly utilized for SLAM experiments in outdoor environments, contains sequences with a substantial presence of dynamic objects. The Dyna-MSDepth generated depth map is directly employed in ORB-SLAM3, followed by trajectory evaluation for each sequence. The Fig. 7 demonstrates the pronounced scale drift occurring due to the monocular camera’s continuous operation over vast distances. With the integration of the depth map, a substantial alleviation of scale drift is observed. This improvement in scale accuracy is further validated by the data presented in the Table 5, highlighting the significant enhancement in positioning precision achieved through the incorporation of Dyna-MSDepth generated depth maps.

5 Conclusion

Monocular SLAM is widely used in visual localization, mapping, 3D reconstruction, and navigation tasks due to its low cost and easy configuration. However, it suffers from scale ambiguity and significant scale drift during long-term running, particularly in dynamic scenes where geometric consistency assumptions are violated.

Fig. 7
figure 7

Qualitative comparison results show that the depth map estimated by Dyna-MSDepth is directly used to reduce the scale drift of monocular SLAM. The sequence from left to right is 00, 05, 06, 07, 08, 09

To address these issues, this paper proposes Dyna-MSDepth, a self-supervised monocular depth estimation network that restores scale consistency and provides globally consistent depth maps in dynamic scenes. Dyna-MSDepth employs self-supervised training and a specific loss function to generate dense depth maps with continuous values, ensuring scale consistency for monocular SLAM. A dynamic optimization strategy is introduced to estimate reliable depth maps in the presence of dynamic objects. Furthermore, multi-scale inputs are introduced to enable Dyna-MSDepth to perceive the depth values of objects with different distances and scales.

Qualitative and quantitative evaluations on challenging dynamic datasets (KITTI, TUM, BONN, DDAD) demonstrate that Dyna-MSDepth outperforms existing state-of-the-art methods in monocular depth estimation. Monocular SLAM experiments on the KITTI datasets further confirm the effectiveness of Dyna-MSDepth in enabling accurate mapping and navigation tasks.

The paper concludes with the following findings:

Table 5 The quantitative comparison results (m) of Dyna-MSDepth before and after the introduction of the depth map estimated on the KITTI dataset
  1. 1.

    The proposed Dyna-MSDepth can estimate stable, reliable and consistent multi-scale depth maps in dynamic scenes;

  2. 2.

    Evaluation on four challenging dynamic datasets demonstrates that Dyna-MSDepth outperforms other state-of-the-art methods, as observed through qualitative and quantitative analysis;

  3. 3.

    The depth maps generated by Dyna-MSDepth can be directly utilized in monocular SLAM without the need for additional complex post-processing, highlighting its practical applicability.