Keywords

1 Introduction

Depth is so essential for perceiving, understanding and navigating the 3D world around us that biological animals have evolved to possess redundant apparatus for perceiving it. Animals are able to infer relative depth of the perceived scene even with a single eye [41]. Although the loss of dimensionality from projecting a 3D scene on a 2D plane cannot be completely recovered, an image consists of many useful monocular cues, such as relative sizes, texture gradient, linear perspective, contrast differences. Image sequences contain additional motion cues such as occlusion, motion parallax [3]. These cues impose constraints on the possible combinations of depth of the underlying scene. Such cues can be implicitly learnt by computational models in a supervised [8, 36], and in a self-supervised [10, 12, 59] manner, making it possible to infer relative depth from a single image. During training, the self-supervised methods rely on motion cues coming from image sequences. They simultaneously estimate relative pose between two consecutive frames of a sequence and their individual depth such that by warping one frame onto the other using the depth, relative pose and camera intrinsics, the other frame can be synthesized. The underlying assumption is of photometric constancy of neighboring frames of a sequence, and thus a loss between the warped neighboring image and the true one can back-propagate through the depth and pose networks.

Fig. 1.
figure 1

Variation of scale factors \(\frac{\text {median}(D_{\text {true}})}{\text {median}(D_{\text {pred}})}\) per frame, within each KITTI validation sequence is shown (Left): as a box-plot, and (Right): over time.

The 2D image pixels might have been projected from infinite number of 3D points, thus making the inverse problem of recovering the depth dimension from the 2D image ill-posed. While cues constrain the solution depth map to have an underlying structure, the estimated depth can only be relative in nature and not absolute, since any scalar multiple of the estimated depth map could also be equally optimal. Furthermore, these self-supervised models do not necessarily learn our standard units of measuring distance, but rather predict depth in their own. Even for humans these units, such as metric, imperial, US customary, tend to vary in usage, for, after all, these units and the scalar factors relating them are a human construct. Such scale ambiguity is not usually a problem, since one can calculate the scale factor relating the model’s depth to that of metric or imperial as a post-processing step and convert one to the other. Unfortunately, the depth estimates from self-supervised models are not just scale ambiguous but also temporally scale inconsistent, i.e. the scale of one frame’s depth map is different to that of the neighboring frame’s. This variation of the scale factors is shown in Fig. 1 as a box-plot for each KITTI validation sequence and also over-time for the top 5 varying sequences. Here, the scale factor of a frame is the ratio of the metric LIDAR depth to that of the predicted depth \(=\frac{\text {median}(D_{\text {true}})}{\text {median}(D_{\text {pred}})}\).

Due to this variation, it is not accurate to analytically determine the scale factor once and use it later, and makes it necessary to calculate the scale in each frame, making it infeasible for applications without availability of some kind of ground-truth, as in case of online videos. In fact, most current methods artificially scale their outputs depth maps per-frame, using the LIDAR ground truth, as a post-processing step before calculating the error at test time.

This work hypothesises that the scale consistency problem is due to the lack of proper temporal constraints, i.e. the scale is independently ambiguous for each frame, and the training pipeline finds a scale that is locally optimal for that frame, regardless if it agrees with the scale of the neighboring frames. While some of the recent works on this issue impose consistency in depth, this work instead explores additional ways of imposing temporal constraints in an unsupervised manner, i.e. without the need of additional ground truth supervision.

2 Related Work

This section explores different ways the scale ambiguity and the scale consistency problem has been tackled in the literature. Also losses similar to ours in related tasks are referred. Temporal data such as video has temporal correlations and has a low probability to abruptly change within short intervals. Temporal consistency exploits this information flow and has been utilized for various video applications [2, 22, 28], including supervised depth estimation [55], style transfer [5], video completion [15] segmentation [56] to capture temporal correlations. Temporal consistency is also the key component of the recent self-supervised monocular depth estimation approaches [19, 59], where the photometric constancy among consecutive frames is assumed and differences after reprojection are minimized. But, since the depth network in these methods is monocular and sees only one frame at a time, the predicted depth maps are not consistent over time.

2.1 Scale-Disambiguation/Consistency-Enforcement via Supervision

Methods with some kind of ground truth supervision are able to enforce scale consistency and do not have the same scale inconsistency issues, for example, the scale factor in stereo methods comes from the relation of disparity and depth as \(\text {depth}=\frac{\text {focal length} \times \text {baseline}}{\text {disparity}}\), and is constant. Following this, Roussel et al. [34] first pre-train on stereo data and then fine-tune on monocular data, to show that doing so retains the scale learnt from the stereo pretraining. GPS or ground-truth pose data has also been used to disambiguate scale [4, 12] via enforcing the pose network’s output to match the pose coming from sensors.

Hand-picked features based depth estimation methods such as SLAM [31] and Structure from Motion (SfM) [37] have been used [26, 40] for a source of supervision, to transfer wide baseline symmetric depth and sparse long-term depth consistencies from it to the depth estimation network. In a similar fashion, ideas from the Visual Odometry (VO), such as epipolar geometry and bundle adjustment, are incorporated [54, 58], to independently compute correspondences and triangulate them to produce sparse depth, which acts as additional supervision to the network’s depth prediction.

Via the Plane Assumption. Following Kitt et al. [20]’s scale recovery for VO, many recent works [29, 52] make use of the following assumptions: a) most automotive cameras are rigidly mounted in a fixed position and at constant orientation with respect to the road, b) the roll and pitch movement of the vehicle have negligible effect on its position and orientation, c) most urban streets may be assumed to be approximately planar in the vicinity of the vehicle. These assumptions allow them to use camera extrinsic parameters, in particular the camera height and compare it with the estimated height by fitting plane on the road, to recover the scale as a post-processing step. [44, 50] also use camera height for scale but incorporate it within training. These methods rely on heuristics that a flat road plane is visible in the area of interest and that the camera position and orientation remain constant over time, which are often not realistic.

Methods that use some kind of supervision for disambiguating depth at each frame, also inherently enforce temporal consistency. Since the supervision is temporally consistent, the predicted depth maps are also made temporally consistent as an unintended consequence.

2.2 Self-supervised Temporal Consistency

Recurrent neural networks [32, 46] have been used to implicitly model multiple frames inputs. Having multiple frames as input directly to the depth network causes holes at regions with moving objects [49], and need to be corrected by an additional single input network. 3D geometric constraints [6, 27, 45] were proposed that penalize the euclidean distances between the reconstructed point clouds of two consecutive frames, after transforming one to the other. Similar geometric constraints on the depth maps were proposed [1, 25, 35] which minimize the inconsistency of the estimated disparity maps of two consecutive frames, after warping one onto the other. Our work lies in this category and differs in the fact that we propose complementary constraints on the pose, which can, in principle, be added and used together with the other temporal constraints.

2.3 Similar Constraints in Literature

Constraints similar to the ones we impose via a loss are found across various computer vision applications.

Forward-Backward Consistency. Li et al. [24] propose a forward-backward pose consistency loss but in a stereo setting, where, the pose from one frame of the left camera to its neighboring frame should be identical to the pose between the same frame of the right camera to its neighboring frame. Based on Narayanan et al. [39]’s forward-backward optical flow consistency assumption, [30] propose a loss in their optical flow prediction network. This was adopted in the context of monocular depth and ego-motion [46, 53], where the flow caused by rigid ego-motion estimation is computed and the forward-backward inconsistency of the rigid flow is minimized. Sheng et al. [38] propose a loss that minimizes the forward-backward inconsistencies in the bi-directional warping fields generated from the rigid ego-motion. [54, 62] train an optical flow network, in addition to the depth and ego-motion networks for per-pixel dense 2D pose between two consecutive frames and adopt the same forward-backward consistency loss for their optical flow. Li et al. [23] propose a forward-backward loss directly on pose estimates. Our forward-backward loss is similar to theirs.

Cycle Consistency. The forward-backward consistency is an instance of the idea of cycle-consistency which has been applied to a wide range of computer vision tasks [14, 16, 47, 60]. Related to our problem of interest, pose cycle consistency was used in the context of Visual Odometry [18, 61]. Ruhkamp et al. [35] propose a strategy to detect and mask inconsistent regions such as occlusions, in neighboring depth maps via a cycle consistency of the reprojected RGB frames.

3 Method

Given video sequences captured from a camera with known intrinsic parameters K, the objective is to learn a depth network model \(f_D: \mathbb {R}^{3\times H\times W} \rightarrow \mathbb {R}^{H\times W}\) that maps an RGB frame at time t, \(I_t\in \mathbb {R}^{3\times H\times W}\) to the depth of its underlying scene \(D_t\in \mathbb {R}^{H\times W}\), and an ego-motion network model \(f_E: \mathbb {R}^{2(3\times H\times W)} \rightarrow \mathfrak {se}(3)\) that take in two consecutive RGB frames \(\{I_t, I_{t+n},\,\) typically, \(n \in \{-1,1\}\}\) and output the 6\(^\circ \)C-of-Freedom (DOF) rigid transformation between them, where HW are the image height and width and, \(\textbf{e} \in \mathfrak {so}(3), \textbf{t} \in \mathbb {R}^3, \textbf{R} \in \text {SO}(3) \) represent the axis-angle, translation and the rotation matrix respectively.

The depth and the ego-motion are jointly trained with the key assumption of photometric constancy, i.e. a view reprojected with the camera intrinsics, depth and the camera motion onto its neighboring view, reconstructs the neighboring view. The reprojection of a view onto the neighbor is as follows

$$\begin{aligned}{}[\hat{u}, \hat{v}, 1]^T \sim K\textbf{R}_t^{t+n}D_t^{ij}K^{-1}[u,v,1]^T+K\textbf{t}_t^{t+n}, \end{aligned}$$
(1)

where [uv, 1] and \([\hat{u}, \hat{v}, 1]\) denote the homogeneous pixel coordinates of view at time t and that of its neighboring view, \(D_t^{ij}\) is the pixel’s corresponding depth and \(\textbf{R}_t^{t+n}, \textbf{t}_t^{t+n}\) denote the rotation and translation from the view at time t to that of the neighboring view at time \(t+n\) respectively. Using the reprojection Eq. (1), the view at time t is synthesized from the neighboring view as

$$\begin{aligned} \hat{I}_{t+n\rightarrow t}[u,v] = I_{t+n}\langle [\hat{u}, \hat{v}] \rangle , \end{aligned}$$
(2)

where \(\langle \rangle \) denotes the sampling operator. A robust photometric loss [11, 57] between the synthesized RGB view \(\hat{I}_{t+n\rightarrow t}\) to the original one \(I_{t}\), as in Eq. (2), is minimized, thereby updating and correcting the depth and ego-motions model weights via backpropagation.

$$\begin{aligned} \mathcal {L}_p(I_{t}, \hat{I}_{t+n\rightarrow t})= \frac{\alpha }{2}(1-\text {SSIM}(I_{t}, \hat{I}_{t+n\rightarrow t})) + (1-\alpha )\Vert I_{t} - \hat{I}_{t+n\rightarrow t}\Vert _1, \end{aligned}$$
(3)

where \(\alpha =0.85\) and SSIM denotes the structural similarity loss [48]. Following Godard et al. [11] an additional edge \(\delta _x, \delta _y\) aware smoothness term regularizes the predicted depth:

$$\begin{aligned} \mathcal {L}_s= \sum _{uv}|\delta _x \bar{D}_t^{uv} |e^{-|\delta _x I_t^{uv}|} + |\delta _y \bar{D}_t^{uv} |e^{-|\delta _y I_t^{uv}|}, \end{aligned}$$
(4)

where \(\bar{D}^{uv}_t\) is the mean normalized depth at pixel [uv]. We establish a strong baseline by following the best practices from Monodepth2 [10]. We also use the minimum reprojection error \(\min _n \mathcal {L}_p(I_{t}, \hat{I}_{t+n\rightarrow t})\), over pairs of both neighboring frames to deal with occlusions and auto-masking to disregard temporally static pixels. These losses are minimized over all pixels in the training set over four resolution scales. The readers are referred to Monodepth2 [10] for more details.

3.1 Pose Constraints

The depth and the pose networks are tightly coupled. We propose to establish temporal consistency in the predicted depth through the pose network’s estimates of the ego-motion between two input frames. Specifically, we propose three constraints that do not add any additional assumptions, but are expected to be met explicitly. Let and be two 6-DOF rigid poses that represent the same transformation, we define the distance metric \(d(T,P): \text {SE}(3) \times \text {SE}(3) \rightarrow \mathbb {R}^+\), between them as

$$\begin{aligned} d (T,P) = d(\textbf{R}_T, \textbf{R}_P) + d(\textbf{t}_T, \textbf{t}_P) = \Big \Vert 1-\left( \frac{\text {tr}(\textbf{R}_T\textbf{R}^T_P) - 1}{2}\right) \Big \Vert _1 + \Vert \textbf{t}_T - \textbf{t}_P \Vert _1 \end{aligned}$$
(5)

where \(\theta = \cos ^{-1} (\frac{\text {tr}(\textbf{R}_T\textbf{R}^T_P) - 1}{2}) \in [0,\pi ]\) is the angle of the relative rotation \(\textbf{R}_T\textbf{R}^T_P\), and \(1-\cos {(\theta )}\) corresponds to the geodesic distance on the 3D manifold of rotation matrices [17, 43]. Throughout the proposed constraints, we use Eq. (5) as the distance metric between SE(3) transformations.

Forward-Backward Pose Consistency. The ego-motion between frames at time t and \(t+n\) should be the inverse of the ego-motion between frames at time \(t+n\) and t, i.e. \(\textbf{T}^t_{t+n} {\mathop {=}\limits ^{!}} {\textbf{T}_t^{t+n}}^{-1}\). This is characterized by minimizing the following loss

$$\begin{aligned} \mathcal {L}_{\textbf{T}_{fb}} = d(\textbf{T}^t_{t+n}, {\textbf{T}_t^{t+n}}^{-1}), \end{aligned}$$
(6)

where

Identity Pose Consistency. Self-supervised monocular depth estimation operates on the underlying assumption of moving camera and previous works address this by proposing masking strategies to filter out pixels with no overall motion, either due to static camera or moving objects [10]. This work, in addition, forces the pose network to explicitly inspect static images and estimate no relative ego-motion. This is done by giving the pose network same frames and having a loss that minimizes any ego-motion . The loss is as follows

$$\begin{aligned} \mathcal {L}_{\textbf{T}_{id}} = \Vert \textbf{T}_t^t \Vert _1 = \Vert \textbf{e}_{t}^t \Vert _1 + \Vert \textbf{t}_{t}^t \Vert _1 \end{aligned}$$
(7)
Fig. 2.
figure 2

Illustration of the proposed: (a) forward-backward, (b) identity and (c) cycle, pose constraints

Cycle Pose Consistency. Cyclic pose consistency states that the poses need to be consistent in a cyclic manner, i.e. the combined pose from \(t-n\) to \(t+n\) via t, should be the same as the direct ego-motion between them, \(\textbf{T}^{t+n}_{t-n} {\mathop {=}\limits ^{!}} \textbf{T}_t^{t+n}\times \textbf{T}_{t-n}^{t}\), as illustrated in Fig. 2(c). The loss is parameterized as follows

$$\begin{aligned} \mathcal {L}_{\textbf{T}_{cyc}} = d(\textbf{T}^{t+n}_{t-n}, \ \textbf{T}_t^{t+n}\times \textbf{T}_{t-n}^{t}) \end{aligned}$$
(8)

4 Experiments

To maintain consistency and comparability, the exact experimentational setup of baseline [10] is maintained. Following the established protocols by Eigen et al. [7], we train and test on Zhou et al. [59]’s splits of the KITTI [9] raw dataset, with a widely-used pre-processing to remove static frames [59] and cap the depth at 80m. We evaluate the depth model using popularly used metrics from Eigen et al. [7]. We also consider the improved ground truth depth maps [42] for evaluation, which uses stereo information and 5 consecutive frames to accumulate LiDAR points to handle moving objects, resulting in high quality depth maps. We weight our loss with a weight of 0.1 when addition to the final objective.

4.1 Temporal Scale Consistency of Predicted Depth

The main goal of this work is to minimize scale inconsistencies. We introduce a simple metric to measure the scale consistency across consecutive frames

Table 1. Depth evaluation with each of our constraints applied individually. The second column shows the coeffecient of correlation (normalized standard deviation) of scales as a measure of consistency. The top section uses ground-truth scaling in order to make the depth metrics comparable. In the bottom section, a per-sequence (in contrast to per-frame) median scaling is used. Our constraints show reduction in inconsistencies, while also slightly improving the depth.

Consistency Metric. We hypothesize that in practice, it would be possible to calculate the actual scale once and use it for the rest of the sequence. Thus, the actual value of the scale does not matter but only its normalized variation w.r.t. time. We define the coefficient of correlation of the computed scale factors as \( \frac{\sigma (\text {scales})}{\mu (\text {scales})},\) where scale=\(\frac{\text {median}(D_{\text {true}})}{\text {median}(D_{\text {pred}})}\), as a measure of temporal scale consistency in depth.

Table 1 compares our proposed constraints on the standard depth metrics as well as the proposed consistency metric. Additionally, for each sequence in the KITTI test set, we calculate the scale as the median of all per-frame scales and use it throughout the sequence, resulting in the evaluation shown in Table 1’s bottom two rows. Our constraints show reduction in inconsistencies, while also slightly improving the depth.

4.2 Depth Evaluation

Table 2 compares our best model (with \(\mathcal {L}_{\textbf{T}_{cyc}}\)) with works which tackle the scale ambiguity/scale consistency problem. Works which make use of some kind of ground truth supervision, shown in the top section of Table 2, disambiguate scale as a byproduct and, as a result, are automatically scale consistent. Different types of supervision signals have been used as mentioned in the Supervision column. In the bottom section, works tackling the problem without additional supervision are mentioned. Our work belongs in this category.

Table 2. Comparison of our results with works that focus on temporal consistency, on the test set of KITTI [9]’s Eigen [7] split. Works in the botom section use monocular (M) supervision, while those in the top section use additional supervision: D for LIDAR depths, h for camera extrinsics (height), v for velocity (GT pose) and SfM/SLAM for depth hints from classical methods. For fair comparison, only the contributions tackling the inconsistency problem are compared against. The best and second best methods are in bold and underlined respectively.

Table 2 shows that our proposed pose consistency constraint gives improved performance with respect to baseline [10], although this was not our goal, showing that temporal consistency is also important for the accuracy

4.3 Ego-motion Evaluation

Following Zhou et al. [59]’s protocols, we also evaluate our ego-motion network. We train on KITTI odometry splits sequences 0–8 and test on sequences 9 and 10. Similar to the related methods, we compute the absolute trajectory error (ATE) averaged over all overlapping 5-frame snippets. Since our pose network only takes 2 frames as input, we aggregate the relative poses to create 5-frame trajectories. Table 3 summarizes our improvements with respect to our baseline [10]. Our proposed loss constrains the solution space, thereby making a better optimum easier to achieve.

Table 3. Ego-motion estimation results: average absolute trajectory error, and standard deviation, in meters, on KITTI Odommetry dataset [9]. Trained on Seq. 0-8 and tested on Seq. 9 and 10. Our pose constraints show improvement with respect to baseline [10].

5 Conclusion and Future Work

Self-supervised monocular depth and ego-motion estimation methods suffer not only from scale ambiguity but also from scale inconsistency. While including some kind of ground truth (GT) supervision not only disambiguates scale but also enforces scale consistency, it is not always plausible to have access to accurate ground-truth information. We propose ego-motion constraints that do not require any additional GT supervision. Via experimentation, we show that our proposed constraints not only decrease the inconsistencies but also improve the depth’s and ego-motion’s performance compared to baseline.

Our constraints do not aim to compete but complement the ones used in literature, for example, SC-SfMLearner [1]’s depth consistency. We have looked at how individual constraint performs. We leave the effect of their interactions with one another and constraints used by previous works for the future. We also do not loosen the static scene assumption. If the image motion is dominated by moving objects, it is indeed difficult to estimate the true ego-motion caused just by cameras. It would be interesting to generalize these constraints with denser motion models, such as Li et al. [23]’s piece-wise rigid flows considering dynamic objects or dense per-pixel optical flows [13].