Real-Time 3D Reconstruction of Colonoscopic Surfaces for Determining Missing Regions

Ma, Ruibin; Wang, Rui; Pizer, Stephen; Rosenman, Julian; McGill, Sarah K.; Frahm, Jan-Michael

doi:10.1007/978-3-030-32254-0_64

Ruibin Ma¹⁶,
Rui Wang¹⁶,
Stephen Pizer¹⁶,
Julian Rosenman¹⁶,
Sarah K. McGill¹⁶ &
…
Jan-Michael Frahm¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11768))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

9252 Accesses
42 Citations

Abstract

Colonoscopy is the most widely used medical technique to screen the human large intestine (colon) for cancer precursors. However, frequently parts of the surface are not visualized, and it is hard for the endoscopist to realize that from the video. Non-visualization derives from lack of orientations of the endoscope to the full circumference of parts of the colon, occlusion from colon structures, and intervening materials inside the colon. Our solution is real-time dense 3D reconstruction of colon chunks with display of the missing regions. We accomplish this by a novel deep-learning-driven dense SLAM (simultaneous localization and mapping) system that can produce a camera trajectory and a dense reconstructed surface for colon chunks (small lengths of colon). Traditional SLAM systems work poorly for the low-textured colonoscopy frames and are subject to severe scale/camera drift. In our method a recurrent neural network (RNN) is used to predict scale-consistent depth maps and camera poses of successive frames. These outputs are incorporated into a standard SLAM pipeline with local windowed optimization. The depth maps are finally fused into a global surface using the optimized camera poses. To the best of our knowledge, we are the first to reconstruct dense colon surface from video in real time and to display missing surface.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Depth Estimation for Colonoscopy Images with Self-supervised Learning from Videos

$$\hbox {C}^3$$ Fusion: Consistent Contrastive Colon Fusion, Towards Deep SLAM in Colonoscopy

A Surface-Normal Based Neural Framework for Colonoscopy Reconstruction

Keywords

1 Introduction

Colorectal cancer is the third most common cancer in men and the second in women worldwide [6]. Colonoscopy is an effective method of detecting and removing pre-malignant polyps.

There is strong evidence to support the assertion that polyps and adenomas of all kinds are missed at colonoscopy (pooled miss-rate 22% [8] among multiple studies). An important cause is that the colonic mucosal surface was not entirely surveyed [5]. However, it is very difficult to detect missing colonic surface from video alone, let alone quantify its extent, because one sees only a tiny fraction of the colon at any given time rather than a more global view. The solution is to build a system to visualize missing colon surface area by reconstructing the streaming video into a fully interactive dense 3D textured surface that reveals holes in the surface if regions were not visualized (Fig. 1). This should be done in real time so that the endoscopist can be alerted to the unseen surface in a timely manner so that the situation can be remedied.

Hong et al. [4] used haustral geometry to interpolate the virtual colon surface so as to find missing regions. However, their work only provided single-frame reconstruction and haustral occlusion (without fusion), which is inadequate to determine what has been missed during the procedure. Also, there is no inter-frame odometry being used, which could boost reconstruction accuracy. Armin et al. [1] produced a 2D visibility map which was less intuitive than a 3D dense reconstruction. Zhao et al. [15] used Shape From Motion and Shading for dense endoscopy reconstruction but is not real time.

The SLAM (simultaneous localization and mapping) [2, 3, 7] and the Structure-from-Motion (SfM) methods [9] take a video as input and generate both 3D point positions and a camera trajectory. However, besides the fact that most of them do not generate dense reconstructions, they work poorly on colonoscopy images for the following reasons: (1) colon images are very low-textured, which is a disadvantage for the feature-point-based methods, e.g., ORBSLAM [7]; (2) photometric variations (caused by moving light source, moist surface and occlusions) and geometric distortions make tracking (predicting camera pose and 3D point positions for each frame) too difficult; (3) lack of translational motion and poor tracking leads to severe camera/scale drift (Fig. 2) and noisy 3D triangulation.

Convolutional neural networks (CNN) have been used for SLAM tasks and predicting dense depth maps [12, 14, 16]. However, these end-to-end networks are subject to accumulated camera drift because there is no optimization used during prediction as in standard SLAM systems. In contrast, there are works that use CNN to improve a standard SLAM system [11, 13]. CNN-SLAM [11] incorporated CNN depth prediction to the LSD-SLAM [3] pipeline to provide robust depth initialization. The dense depth maps are finally fused into a global mesh. Yang et al. [13] used CNN-predicted depth (trained on stereo image pairs) to solve the scale drift problem in Direct Sparse Odometry (DSO) [2]. However, there are neither stereo images nor groundtruth depth for colonoscopy images. Also, training a CNN on colonoscopy images will be difficult due to the aforementioned challenges.

In this paper, we present a deep-learning-driven colonoscopic SLAM system. We develop a recurrent neural network (RNN) to predict both depth and camera poses and combine it in a novel fashion with a SLAM pipeline to improve the stability and drift of successive frames’ reconstructions. The RNN training addresses the difficulties of reconstructing from colonoscopy images. The SLAM pipeline optimizes the depth and camera poses provided by the RNN. Based on these optimized camera poses, the depth maps of the keyframes are fused into a textured global mesh using a nonvolumetric method. Our method produces a high-quality camera trajectory and colon reconstruction which can be used for missed region visualization in colonoscopy. The whole system runs in real time.

2 Methodology

2.1 Full Pipeline

The full pipeline includes the following steps: (1) Deep-learning-driven tracking: predicting frame-wise depth map and tentative camera pose which are used to initialize the photoconsistency-based tracking; (2) Keyframe selection: upon enough camera motion, creating a new keyframe as the new tracking reference and updating the neural network; (3) Local windowed optimization: the camera poses and sparsely sampled points’ depth values of the latest N (e.g., 7) keyframes are jointly optimized; (4) Marginalization: the oldest keyframe in window is finalized, i.e., marginalized from the optimization system; (5) Fusion: using optimized camera pose, the image and the depth map of the marginalized keyframe is fused with existing surface. We will detail item 1 in Sect. 2.2, items 2–4 in Sect. 2.3 and item 5 in Sect. 2.4 (Fig. 3).

2.2 Deep-Learning-Driven Tracking

Our deep-learning-driven tracking is developed upon RNN-DP (a recurrent neural network for depth and pose estimation [12]) that predicts a depth map and a camera pose for each image in the video. However, it cannot be directly trained on colonoscopy videos because there is no groundtruth depth available. In addition, the pose estimation network in RNN-DP is trained based on image reprojection error, which is severely affected by the specular points and occlusions in colonoscopy videos. Therefore, in this section we present several new strategies that allow RNN-DP to be successfully trained on colonoscopy videos.

To solve the problem of the lack of groundtruth depth, we used SfM [9] to produce a sparse depth map for each individual colonoscopy video frame. These sparse depth maps are then used as groundtruth for RNN-DP training. We collected 60 colonoscopy videos, each containing about 20K frames. Then we grouped every 200 consecutive frames into a subsequence with an overlap of 100 frames with the previous subsequence. Thereby we generated about 12K subsequences from 60 colonoscopy videos. Then we ran SfM [9] on all the subsequences to generate sparse depth maps for each frame. Following the training pipeline in RNN-DP [12], these sparse depth maps are used as ground-truth for training.

To avoid the error from specularity (saturation), we computed a specularity mask $M_{spec}^t$ for each frame based on an intensity threshold. Image reprojection error at saturated regions are explicitly masked out by $M_{spec}^t$ during training.

Colonoscopy images also contain severe occlusions by haustral ridges, so a point in one image may not have any matching point in other images. The original RNN-DP did not handle occlusion explicitly. In order to properly train it on colonoscopy video, we compute an occlusion mask $M_{occ}^t$ to explicitly mask out image reprojection error at occluded regions. The occlusion mask is determined by a forward-backward geometric consistency check, which was introduced in [14].

Our improved RNN-DP outputs frame-wise depth maps and tentative camera poses (relative to the previous keyframe). They are used to initialize the photoconsistency-based tracking [2] that refines the camera pose.

2.3 Keyframe Management and Optimization

In this subsection, we will briefly review how a vanilla SLAM pipeline (DSO) works and then introduce how RNN-DP interacts with the system.

Besides (deep-learning-driven) tracking, the other three main modules of the SLAM system are keyframe selection, local windowed optimization and marginalization. The SLAM system keeps a history of all keyframes. The latest keyframe is used as the tracking reference for the incoming frames. In the keyframe selection module, if the relative camera motion or the change of visual content (measured by photoconsistency) is large enough, the new frame will be inserted into the keyframe set. It will then be used as a new tracking reference.

When a keyframe is inserted, the local windowed optimization module is triggered. The local window contains the latest 7 keyframes. From each of these keyframes, 2000 2D active points are sampled in total, preferring high-gradient regions. Each active point is based on exactly one keyframe but is projected to other keyframes to compute a photometric error. By minimizing the total photometric loss, the camera poses ($7\times 6$ parameters) and the depth values of the sampled points (2000 parameters) are jointly optimized. In addition, to tolerate global brightness change of each keyframe, two lighting parameters per frame are added to model the affine transform of brightness. The purpose of the sampling is to enable efficient joint optimization by maintaining sparsity.

After optimization, the oldest keyframe is excluded from the optimization system by marginalization based on the Schur complement [2]. The finalized reconstructed keyframe is to be fused into the global mesh.

The SLAM system is improved using our RNN-DP network. In the keyframe selection module, when a new keyframe is established, the original DSO used the dilated projections of existing active points to set the depth map for this keyframe, which is used in the new tracking tasks. The resulting depth map is sparse, noisy and is subject to scale drift. In our method we set the depth map for this keyframe using the depth prediction from the network. Our depth maps are dense, more accurate and scale consistent. As a result, it makes the SLAM system easier to bootstrap, which is known to be a common problem for SLAM. On the other hand, the SLAM system also improves the result of raw RNN-DP predictions by optimization, which is very important to eliminate accumulated camera drift of RNN-DP. In summary, this is a win-win strategy.

Our RNN-DP network is integrated into the SLAM system. Its execution is directed by the keyframe decisions made by the system. After tracking, the hidden states of the RNN-DP remain at the stage of the latest keyframe. They are updated only when a new keyframe is inserted.

2.4 Fusion into a Chunk

The independent depth maps predicted by the RNN-DP need to be fused into a global mesh. We use a point-based (nonvolumetric) method called SurfelMeshing [10]. It takes a RGB+depth+camera sequence as input and generates a 3D surface. Since SurfelMeshing requires well-overlapped depth maps, we add a preprocessing step to further align the depths.

Windowed depth averaging: the fusion module keeps a temporal window that keeps the latest 7 marginalized keyframes. In parallel, the depth map of the 6 old keyframes are first projected to the latest keyframe. Second, the new keyframe replaces its depth with the weighted average of the projected depth maps and its current depth. The weights are inversely proportional to time intervals. The average depth is used for fusion. This step effectively eliminates the non-overlapping between depth maps at a cost of slight smoothing.

The fusion result (a textured mesh) is used for missing region visualization and potentially for region measurement.

3 Experiments

Our algorithm is currently able to reconstruct a colon in chunks when the colon structure is clearly visible. The end of a chunk is determined by recognizing a sequence of non-informative frames, e.g., frames of intervening material or bad lighting, whose tracking photoconsistencies are all lower than a threshold. The chunks we reconstructed are able to visualize the missing regions. We provide quantitative results estimating the trajectory accuracy and qualitative results on the reconstruction and missing region visualization.

3.1 Trajectory Accuracy

To evaluate the trajectory accuracy, we compare our method to DSO [2] and RNN-DP [12]. Since there is no groundtruth trajectory for colonoscopic video, to generate high quality camera trajectories in an offline manner, we use colmap [9], which is a state-of-the-art SfM software that incorporates pairwise exhausted matching and global bundle adjustment. These trajectories are then used as “groundtruth” for our evaluation.

Evaluation Metrics. We use the absolute pose error (APE) to evaluate global consistency between the real-time system estimated and the colmap-generated “groundtruth” trajectory. We define the relative pose error $E_i$ between two poses $P_{gt,i}, P_{est,i} \in \mathrm {SE}(3)$ at timestamp i as

$$\begin{aligned} E_i = (P_{gt,i})^{-1} P_{est,i} \in \mathrm {SE}(3) \end{aligned}$$

(1)

The APE is defined as

$$\begin{aligned} APE_i=||trans(E_i)|| \end{aligned}$$

(2)

where $trans(E_i)$ refers to the translational components of the relative pose error. Then different statistics can be calculated on the APEs of all timestamps, e.g., the RMSE:

$$\begin{aligned} \mathrm {RMSE} = \sqrt{\frac{1}{N} \sum _{i=1}^N APE_i^2} \end{aligned}$$

(3)

Table 1. Average statistics based on the APE across 12 colonoscopic sequences

Full size table

Figure 4 shows evaluation results on one colonoscopic sequence. Figure 4a compares the absolute pose error (APE) of the three approaches on the example sequence: our result (red) has the lowest APE at most times. Figure 4b shows APE statistics of the three approaches: our result is better than the other two approaches. Figure 4c shows the trajectories of the three approaches together with the groundtruth. Table 1 shows the statistics of Fig. 4b but averaged across 12 colonoscopic sequences: we achieve the best result on all the metrics.

3.2 Reconstructions and Missing Regions

Figure 5 shows two high-quality examples of fused surfaces. The two chunks are dense and textured. It also shows the incremental fusion process of the first example. The snapshots are captured in real time.

There are multiple reasons for missing regions. Two important ones are lack of camera orientations to the full circumference of parts of a colon and haustral occlusion. These two reasons are respectively illustrated in Figs. 6 and 1. For the four chunks shown in this paper the missing area fraction was notable: 25%, 12%, 10%, and 33% respectively, as verified on the video by our colonoscopiist co-author, Dr. McGill.

Limitations and Future Work. We currently reconstruct in chunks because the tracking will fail upon very large camera motion or deformation. Loop closure is not included in our current system; it could be useful for backward motion. Making the tracking more robust to large deformation and adding loop closure are two future directions.

4 Conclusion

We developed a deep-learning-driven dense SLAM system for colonoscopy. It is the first to reconstruct chunks of a colon as fused surface from a video sequence (vs. existing single-frame methods) in real time. The reconstructions can be used for the visualization of missed colonic surfaces that lead to potential missed adenomas. Our technical contributions include (1) a recurrent neural network that predicts depth and camera poses for colonoscopic images; (2) integrating the recurrent neural network into a standard SLAM system to improve tracking and eliminate drift, and (3) fusion of colonoscopic frames into a global high-quality mesh. Clinically, it should help endoscopists to realize missed colonic surface and resect more pre-cancerous polyps.

References

Armin, A., Chetty, G., De Visser, H., Dumas, C., Grimpen, F., Salvado, O.: Automated visibility map of the internal colon surface from colonoscopy video. Int. J. Comput. Assist. Radiol. Surg. 11 (2016). https://doi.org/10.1007/s11548-016-1462-8
Article Google Scholar
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40, 611–625 (2018)
Article Google Scholar
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Chapter Google Scholar
Hong, D., Tavanapong, W., Wong, J., Oh, J., De Groen, P.C.: 3d reconstruction of virtual colon structures from colonoscopy images. Comput. Med. Imaging Graph. 38(1), 22–33 (2014)
Article Google Scholar
Hong, W., Wang, J., Qiu, F., Kaufman, A., Anderson, J.: Colonoscopy simulation. In: Proceedings of SPIE (2007)
Google Scholar
Jemal, A., Center, M.M., DeSantis, C., Ward, E.M.: Global patterns of cancer incidence and mortality rates and trends. Cancer Epidemiol. Prev. Biomark. 19(8), 1893–1907 (2010)
Article Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardós, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)
Article Google Scholar
van Rijn, J.C., Reitsma, J.B., Stoker, J., Bossuyt, P., van Deventer, S., Dekker, E.: Polyp miss rate determined by tandem colonoscopy: a systematic review. Am. J. Gastroenterol. 101, 343 (2006)
Article Google Scholar
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Schöps, T., Sattler, T., Pollefeys, M.: Surfelmeshing: online surfel-based mesh reconstruction. CoRR abs/1810.00729 (2018). http://arxiv.org/abs/1810.00729
Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6565–6574, July 2017
Google Scholar
Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-)supervised learning of monocular video visual odometry and depth. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Yang, N., Wang, R., Stückler, J., Cremers, D.: Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 835–852. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_50
Chapter Google Scholar
Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR, pp. 1983–1992 (2018)
Google Scholar
Zhao, Q., Price, T., Pizer, S., Niethammer, M., Alterovitz, R., Rosenman, J.: The endoscopogram: a 3D model reconstructed from endoscopic video frames. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9900, pp. 439–447. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46720-7_51
Chapter Google Scholar
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, USA
Ruibin Ma, Rui Wang, Stephen Pizer, Julian Rosenman, Sarah K. McGill & Jan-Michael Frahm

Authors

Ruibin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Pizer
View author publications
You can also search for this author in PubMed Google Scholar
Julian Rosenman
View author publications
You can also search for this author in PubMed Google Scholar
Sarah K. McGill
View author publications
You can also search for this author in PubMed Google Scholar
Jan-Michael Frahm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Wang .

Editor information

Editors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Dinggang Shen
University of Georgia, Athens, GA, USA
Tianming Liu
Western University, London, ON, Canada
Terry M. Peters
Yale University, New Haven, CT, USA
Lawrence H. Staib
University of Strasbourg, Illkirch, France
Caroline Essert
United Imaging Intelligence, Shanghai, China
Sean Zhou
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Pew-Thian Yap
Western University, London, ON, Canada
Ali Khan

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 12204 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, R., Wang, R., Pizer, S., Rosenman, J., McGill, S.K., Frahm, JM. (2019). Real-Time 3D Reconstruction of Colonoscopic Surfaces for Determining Missing Regions. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11768. Springer, Cham. https://doi.org/10.1007/978-3-030-32254-0_64

Download citation

DOI: https://doi.org/10.1007/978-3-030-32254-0_64
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32253-3
Online ISBN: 978-3-030-32254-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)