Keywords

1 Introduction

Colorectal cancer is recently reported as the third most prevalent malignancy and the fourth most common cause of cancer-associated death worldwide [1, 13, 15]. Colonoscopy is an effective technique for the prevention and treatment of colon cancer. Many approaches have been proposed for colorectal polyp detection and diagnosis in colonoscopy images and videos [5, 9, 20, 21]. While geometric features, e.g. location, size, and shape of polyps, are critical for colorectal polyp diagnosis, depth estimation from colonoscopy images could help a lot in deriving 3D geometric information of the intestinal environment.

Many efforts have been put into depth estimation and 3D reconstruction of intestinal environments from colonoscopy videos. Low-level geometric clues are utilized in earlier model-based approaches. Hong et al. [3] estimate depths from colon fold contours. Zhao et al. [22] combine structure-from-motion (SfM) and shape-from-shading (SfS) techniques for surface reconstruction. These approaches suffer from reflection and low texture of colonoscopy images, resulting in inconsistency and great sparseness of the estimated depth maps. In order to enhance surface textures for more robust SfM, Widya et al. [17,18,19] use chromoendoscopy with the surface dyed by indigo carmine. However, chromoendoscopy is not very common, which leads to limitations in application. Besides, these approaches require dense feature extraction and global optimization, which is computationally expensive.

Deep-learning-based approaches have achieved remarkable performance in general depth estimation recently. Compared with natural scenes where ground-truth depth can be obtained using depth cameras or LiDARs, acquiring the ground-truth depth for colonoscopy videos is arduous. Ma et al. [7] use an SfM approach [12] to generate sparse colonoscopy depth maps as ground-truth to train a depth estimation network. However, due to the inherited limitation of SfM in low-quality reconstruction for textureless and non-Lambertian surfaces, it is challenging to obtain accurate dense depth maps for supervised learning. Assuming temporal consistency between frames in videos, unsupervised depth estimation has also been studied [2, 6, 23]. Liu et al. [6] propose a self-supervised depth estimation method for monocular endoscopic images using depth consistency check between adjacent frames with camera poses estimated by SfM. Freedman et al. [2] propose a calibration-free unsupervised method by predicting depth, camera pose, and intrinsics simultaneously. However, for colonoscopy videos with weak illumination in complex environments, these unsupervised approaches face significant challenges posed by frequent occlusions between colon folds and non-Lambertian surfaces.

Many works use synthetic data to produce precise ground truth depth for network training. Mahmood et al. [8] train a joint convolutional neural network-conditional random field framework on synthetic data and transfer real endoscopy images to synthetic style using a transformer network. Rau et al. [11] train an image translation network pix2pix [4] with synthetic image-and-depth pairs to directly translate a colonoscopy image into a depth map. In order to reduce the domain gap between synthetic data and real images, the GAN loss also involves the depth maps predicted from real colonoscopy images but \(L_1\) loss is not computed since no ground truth is available for real images. By doing so, the generator is expected to learn to predict realistic-looking depth maps from real images. However, without accurate supervision on the details in the predicted depth map, it is non-trivial for the generator to precisely predict depth for unseen textures in real colonoscopy images that deviate from synthetic data.

In this paper, we not only utilize synthetic data with ground truth depth to help the network learn fine appearance features for depth estimation but also exploit the temporal consistency between neighboring frames to make full use of unlabeled real colonoscopy videos for self-supervision. Moreover, we design a masked gradient warping loss to filter out non-reliable correspondence caused by occlusions or reflections. A more powerful image translation model [16] is also employed in our framework to enhance the quality of depth estimation. We evaluate our method on the synthetic dataset [11] and our real colonoscopy videos. The results show that our method achieves more accurate and temporally consistent depth estimation for colonoscopy images.

2 Methodology

Given a single colonoscopy image \(\mathbf {F}\), our goal is to train a deep neural network DepthNet G that directly generates a depth map \(\mathbf {D}\) as \(\mathbf {D}=G(\mathbf {F})\). In order to train the DepthNet G, we leverage both the synthetic data for full supervision and real colonoscopy videos for self-supervision via temporal consistency. The framework of our approach is shown in Fig. 1. First, we adopt a high-resolution image translation model to train DepthNet in an adversarial manner with synthetic data. Second, we introduce self-supervision during the network training by enforcing temporal consistency between the predicted depths of neighboring frames of real colonoscopy videos.

Fig. 1.
figure 1

Overview of our approach. (a) We first train DepthNet as a conditional GAN with synthetic image-and-depth pairs. (b) The DepthNet is then finetuned with self-supervision by checking the temporal consistency between neighboring frames.

Fig. 2.
figure 2

Checkerboard artifacts. From a colonoscopy image (a), the original pix2pixHD model produces a depth map with checkerboard artifacts (b). Checkerboard artifact is alleviated by replacing the deconvolution layers in the generator with upsampling and convolution layers (c). Smoother depth is generated with our self-supervised model (d).

2.1 Training Baseline Model with Synthetic Data

We adopt the high-resolution image translation network pix2pixHD [16] as our baseline model to translate a colonoscopy image to a depth map. It consists of a coarse-to-fine generator and a multi-scale discriminator in order to produce high-resolution images. The network is trained in an adversarial manner with a GAN loss and feature matching loss [16] on the synthetic dataset [11] which contains paired synthetic colonoscopy images and the corresponding depth maps. However, the original pix2pixHD model produces results with checkerboard artifacts [10], as Fig. 2(b) shows. In order to alleviate this effect, we replace the deconvolution layers in the generator with upsampling and convolutional layers, similar to [6]. Figure 2(c) shows that the checkerboard effect is alleviated by replacing the deconvolutional layers with upsampling and convolutional layers. However, there are still many noises in the predicted results due to the specular reflections and textures, which appear frequently in real colonoscopy images.

2.2 Self-supervision with Colonoscopy Videos

Due to the domain gap between synthetic and real colonoscopy images, when applying the DepthNet trained on the synthetic data to predict depth directly from clinical colonoscopy images, the results tend to be spatially jumping and temporally inconsistent because of the specular reflection and complex textures in intestinal environments, as Fig. 2(c) shows. While obtaining ground-truth depth for real colonoscopy images is arduous, the temporal correlation between neighboring frames in colonoscopy videos provides natural constraints on the predicted depths. Therefore, we propose to enforce temporal consistency between the predicted depths of neighboring frames in network training.

Fig. 3.
figure 3

Depth gradient warping module to check the temporal structural consistency of the predicted depth maps of two neighboring frames.

For two neighboring frames in a real colonoscopy video \(\mathbf {F}_{i}^{r}\) and \(\mathbf {F}_{j}^{r}\), the DepthNet estimates two depth maps \(\mathbf {D}_{i}^{r}\) and \(\mathbf {D}_{j}^{r}\) respectively. In order to check the consistency between these two depth maps, a typical way is to warp one frame to the other according to the camera pose and intrinsic, which are not easy to obtain. In order to avoid camera calibration, we propose a calibration-free warping module that finds pixel correspondences from optical flows. A pre-trained network PWC-Net [14] is employed to infer optical flows. Due to self-occlusions and reflections in colons, brightness consistency is not guaranteed so that errors in optical flow estimation are inevitable. In order to filter out the optical flow noises, we estimate optical flows \(\mathbf {O}_{i \rightarrow j}\) and \(\mathbf {O}_{j \rightarrow i}\) in two directions. Then we check if a pixel \(\mathbf {p}\) can be warped back to the same position from frame i to frame j by \(\mathbf {O}_{i \rightarrow j}\) then from frame j to frame i by \(\mathbf {O}_{j \rightarrow i}\). If not, the pixel \(\mathbf {p}\) is filtered out when checking temporal consistency. Therefore, we compute a mask \(\mathbf {M}_{i}\) for frame i as

$$\begin{aligned} \mathbf {M}_{i}(\mathbf {p})=\left\{ \begin{array}{l} 0,\quad \left| \mathbf {O}_{i \rightarrow j}(\mathbf {p})+ \mathbf {O}_{j \rightarrow i}(\mathbf {q})\right| > \varepsilon \\ 1, \quad otherwise \end{array}\right. \end{aligned}$$
(1)

where \(\mathbf {q}\) is the corresponding location in frame \(\mathbf {F}^{r}_j\) of the pixel \(\mathbf {p}\) in frame \(\mathbf {F}^{r}_i\) according to the estimated optical flow \(\mathbf {q}=\mathbf {p}+\mathbf {O}_{i \rightarrow j}(\mathbf {p})\). Note that we use bilinear interpolation of \(\mathbf {O}_{j \rightarrow i}(\mathbf {q})\) for a subpixel \(\mathbf {q}\). \(\varepsilon \) is a threshold for the forward-backward warping distance check. We set \(\varepsilon =1\) in our experiments.

However, the camera shifts at two neighboring frames. As a result, the absolute depth values of the correspondence pixels in two neighboring frames are not equal. Instead of comparing the depth values directly, we encourage the structural consistency between two depth maps by comparing the gradients of two depth maps through the depth gradient warping module. As Fig. 3 shows, we compute the gradients \((\mathbf {G}_{i}^{x}, \mathbf {G}_{i}^{y})\), \((\mathbf {G}_{j}^{x}, \mathbf {G}_{j}^{y})\) of the two predicted depth maps \(\mathbf {D}_{i}^{r}\) and \(\mathbf {D}_{j}^{r}\) in x and y direction. Then we check the consistency between the depth gradients of two neighboring frames with the mask \(\mathbf {M}_i\) to calculate the masked gradient warping loss for self-supervision:

$$\begin{aligned} L_{MGW}=\frac{1}{|\mathbf {M}_{i}|} \sum _{\mathbf {p} \in \mathbf {F}_{i}^{r}} \mathbf {M}_{i}(\mathbf {p}) \Big (\left| \mathbf {G}_{i}^{x}(\mathbf {p})-\widetilde{\mathbf {G}}_{i}^{x}(\mathbf {p})\right| + \left| \mathbf {G}_{i}^{y}(\mathbf {p})-\widetilde{\mathbf {G}}_{i}^{y}(\mathbf {p})\right| \Big ), \end{aligned}$$
(2)

where \(\widetilde{\mathbf {G}}_{i}^{x}, \widetilde{\mathbf {G}}_{i}^{y}\) are the gradient maps warped from \(\mathbf {G}_{j}^{x}, \mathbf {G}_{j}^{y}\) according to the estimated optical flow \(\mathbf {O}_{j \rightarrow i}\) by bilinear interpolation.

Our full objective combines both self-supervision with masked gradient warping loss \(L_{MGW}\) and supervision with GAN loss \(L_{GAN}\) and feature matching loss \(L_{FM}\) with \(\alpha \) and \(\gamma \) balance the three loss terms:

$$\begin{aligned} L=\alpha L_{MGW}+\gamma L_{FM}+L_{GAN}. \end{aligned}$$
(3)

3 Experiments

3.1 Dataset and Implementation Details

Both synthetic and real colonoscopy data are used for training and evaluation. We use the UCL synthetic dataset published by Rau et al. [11]. The dataset consists of 16,016 pairs of synthetic endoscopic images and the corresponding depth maps. Following their split strategy, the dataset is divided randomly into training, validation, and test set by 6:1:3. We also collect 57 clinical colonoscopy videos from different patients. In the training stage, we use neighboring frames from each video at different intervals. Trading off overlap and interval between frame pairs, we choose four intervals including 1, 4, 8, and 16 frames. The final dataset of real colonoscopy data contains 6,352 training pairs and 4,217 test pairs.

Both the synthetic images and real images are resized to 512 \(\times \) 512. We train our network in two steps. In the first step, we train our model on the synthetic data only. In the second step, we finetune the model with self-supervision on real colonoscopy frames. The batch size of synthetic images and real images for the first step and second step is set 8 and 4 respectively. We employ Adam optimizer with \(\beta _1=0.5\) and \(\beta _2=0.999\). The learning rate starts with \(5e-5\) and linearly decays. We update the generator every iteration while update the discriminator every 5 iterations. The framework is implemented in PyTorch 1.4 and trained on 4 Nvidia Titan XP GPUs. The first step training takes 70 epochs and we add the second step finetuning with real data at the last 10 epochs. The weight for the masked gradient warping loss \(\alpha =5\) initially and linearly increases by 2 in the second step. The weight of feature matching loss \(\gamma =2\).

3.2 Quantitative Evaluation

In order to quantitatively evaluate the performance of our method on depth estimation of colonoscopy images, we compare our method with previous approaches on the UCL synthetic dataset [11]. We adopt the same three metrics including the absolute \(L_1\) distance, the relative error, and the root-mean-squared-error RMSE between the ground truth and prediction. The results are reported in Table 1.

Table 1. Quantitative evaluation on the UCL synthetic dataset (* in cm, ** in %).

Our baseline model is only trained with synthetic data. It shows that a better conditional GAN model (pix2pixHD instead of pix2pix) brings great performance improvement. While only tested on synthetic data, our model that is fine-tuned with real colonoscopy videos does not make further improvement on synthetic data. This is reasonable because the self-supervision between neighboring frames leverages temporal consistency for more depth data from real colonoscopy video but it does not bring more information for the synthetic data.

Although the self-supervision with temporal consistency does not bring gain on the mean accuracy, it significantly improves the temporal consistency between the estimated depths on both the synthetic and real colonoscopy data. We quantify the temporal consistency by the masked gradient warping loss \(L_{MGW}\), which reflects the structural consistency between the estimated depth maps of two neighboring frames. Table 2 demonstrates that our method reduces the masked gradient warping loss on both the synthetic data and real data.

Table 2. Masked gradient warping loss on synthetic and real colonoscopy datasets.

3.3 Qualitative Evaluation on Real Data

Without ground-truth depths for quantitative comparison on real colonoscopy data, we evaluate our method qualitatively by comparing the depth prediction results with other methods. First, we compare our method with Rau et al. [11] and show some examples in Fig. 4. For the first three examples, we observe the wrongly predicted location of the lumen, missed polyps, and misinterpreted geometry of the lumen respectively in the results generated by Rau et al.. For the last three examples, we can see that our method generates more accurate predictions, proving that our model better captures geometric structure details.

We also verify our model in regards to the temporal consistency of the depth estimation. As shown in Fig. 5, without supervision by temporal consistency, the baseline model tends to predict discontinuous depths on the polyp surface due to the specular reflection in the colonoscopy frames. These depth noises also lead to the discontinuity between neighboring frames. In comparison, the depths predicted by our fine-tuned model are more spatially smooth and temporally consistent, avoiding the interruption by specular reflections and textures.

Fig. 4.
figure 4

Comparison of our method with Rau et al. [11]. The red ellipses highlight the inaccurate depth predictions such as wrong locations of the lumen, missed polyp, and misinterpreted geometry of the lumen.

Fig. 5.
figure 5

Depth estimation for adjacent frames in a real colonoscopy video. Compared with the results generated by the baseline model, our model produces more consistent results avoiding the noises caused by specular reflections and textures.

4 Conclusion

We propose a novel depth estimation approach for colonoscopy images that makes full use of both synthetic and real data. Considering the depth estimation task as an image translation problem, we employ a conditional generative network as the backbone model. While the synthetic dataset which contains image-and-depth pairs provides precise supervision on the depth estimation network, we exploit unlabeled real colonoscopy videos for self-supervision. We designed a masked gradient warping loss to ensure the temporal consistency of the estimated depth maps of two neighboring frames during network training. The experimental results demonstrate that our method produces more accurate and temporally consistent depth estimation for both synthetic and real colonoscopy videos. The robust depth estimation will facilitate the accuracy of many downstream medical analysis tasks, such as polyp diagnosis and 3D reconstruction, and assist colonoscopists in polyp localization and removal in the future.