Keywords

1 Introduction

Colorectal cancer is ranked as the type of cancer that is third most likely to claim people’s lives in Australia, and the fourth worldwide [1, 2]. Optical colonoscopy has been known as the gold standard method for detecting and removing colonic polyps, the precursor of bowel cancer [3].

Estimating the colonoscope position with high accuracy is important as it can determine the location of detected polyps, especially when there is a need for subsequent surgery for removing cancerous polyps [4]. Despite the work that has been done in estimating the colonoscope position (camera pose) from optical colonoscopy [5, 6], accurately localizing the position of colonoscope with respect to the colon’s surface remains a critical issue.

Conventional methods to estimate camera motion from an endoscopy procedure such as optical flow [5,6,7] or hybrid methods [8, 9] are time consuming, sensitive to feature matching, require an offline camera or sensor calibration and resulted in a drift in camera pose estimation. Here, we develop a method based on deep convolutional neural network (CNN) to estimate relative camera pose between two consecutive frames, which is independent to traditional feature detection and tracking, and reduces the camera drift.

In recent years, convolutional neural networks have been widely used in various computer vision fields. Although they were initially designed for classification purposes [10], the recent CNNs with advanced architectures have shown significant results in problems including object recognition [11], optical flow estimation [12], and dense feature matching [13] by means of simulated or actual data. Using artificial neural networks (ANN), Bell et al. [14] estimated the camera pose of teleoperated flexible endoscopes by training ANNs with optical flow magnitude and angle when the endoscope was moved by a robotic hand inside a plastic colon phantom. Recently, Kendall et al. [15] regressed the camera pose from a single RGB image by training a CNN with camera pose which was estimated offline by a structure from motion algorithm as ground truth. The main challenge then was lack of ground truth for scenes which had not enough features to track to estimate ground truth through SfM. Since annotating real images is difficult and expensive, application of synthetic data has boosted its popularity as an alternative to train networks [16, 17].

In this paper, we aim to estimate camera pose from actual optical colonoscopy video frames. To achieve this, rather than using SfM [15] to generate a ground truth, we trained a CNN by simulated colonoscopy frames for which the camera poses were available from the simulator as ground truth. The camera pose for actual colonoscopy frames was then regressed when the actual colonoscopy frames were passed to the network. The results obtained from the CNN were compared to a feature based algorithm which is explained in [18]. In addition, the performance of different networks architecture and input data (optical flow) were investigated. A diagram of our method is demonstrated in Fig. 1 and described in the following sections.

Fig. 1.
figure 1

Main processing steps of our proposed method

2 Method

We presented two approaches, we trained the CNN by optical flow patterns between consecutive frames, by or alternatively directly by consecutive frames. The camera pose parameters inferred from the CNN were compared to a structure from motion (SfM) algorithm [18]. First we briefly describe the preprocessing including frame preparation for training CNNs with frames, SIFT flow [19] estimation for training with motion field, and SfM as a camera pose estimation method, before explaining the proposed CNN’s model architecture and training details.

2.1 Pre-processing and SfM

Frames were prepared through the following steps; (i) frames were converted to grayscale, (ii) the black corners were removed as they had no information (iii) frames were resized to train modified AlexNet and GoogLeNet. The size of input data (\( height\, \times \,width\, \times \,number \,of\,frames \)) for AlexNet and GoogLeNet were (\( 227\, \times \,227\, \times \,2 \)), (\( 224\, \times \,224\, \times \,2 \)) respectively. To estimate the optical flow pattern, the SIFT flow algorithm [19] was utilized to extract and match features between two consecutive grayscale frames. The final input data had the size of (\( 227\, \times \,227\, \times \,2 \)), and included motion field in u and v direction.

2.2 Model for Estimating Camera Pose

In this section, we describe the CNN models that estimate the camera pose parameters. The input to our models are either: the two consecutive grayscale frames; or optical flow pattern between consecutive frames. The outputs are relative camera translations and rotations with respect to the colon surface with six degrees of freedom (DoF).

Learning camera translation and rotation.

Camera rotation and translation parameters were regressed by training the CNN to minimize the following objective function:

$$ Loss = \beta \cdot \left\| {R_{target} - R_{predicted} } \right\|^{2} + \left\| {T_{target} - T_{predicted} } \right\|^{2} $$

Where \( \beta \) is the weight factor for our dataset is one, (\( R_{target} ,T_{target} ) \) are camera rotation (degree) and translation (mm) which are available from the simulator as ground truth, and (\( R_{predicetd} ,T_{predicetd} ) \) are camera pose parameters predicted by the CNN. The camera translation is normalized to unit vector and is unit less. In our experiments, the camera rotation is represented in Eulerian angle (α,ψ,γ). Applying Euclidian distance on Euler angle may result in more than one set of values which can yield the same angle representation. According to [21], this can be prevented under the following conditions on the Euler angles: α, γ ∈ [−π, π); ψ ∈ [−π/2, π/2), and therefore L2 on rotation is a metric on SO(3) [21]. In our dataset the maximum of relative rotation is below the mentioned range in [21].

Network architecture.

The base of the architecture of our CNN is the state-of-the-art GoogLeNet for the direct image pair. We also modified AlexNet for the optical flow approach to compare the results. These networks were originally designed for image classification. We applied the following changes on both GoogLeNet and AlexNet to regress the camera parameters; (i) considering the input data to the network, which are motion features (u,v) in two dimensions or two consecutive frames in grayscale, the first convolutional layer filter (filter size; input channel; number of filters) was modified to (11;2;96) for AlexNet and (7;2;4) for GoogelNet, allowing networks to operate in two dimensions; and (ii) the Softmax classifier was replaced by two fully connected layers, each with three outputs to estimate relative camera rotation and translation. The outputs of last layers then passed to a Euclidean loss function to regress the camera pose. The schematic of our networks are shown in Fig. 2.

Fig. 2.
figure 2

The general architecture of our CNNs to predict camera pose parameters, the first layer is modified to accept two gray images or optical flow, and the classification layer was replaced by Euclidean loss to optimize predicted rotation and translation by network. Note that we trained AlexNet with both optical flow and image for comparison purposes.

Training details and transfer learning.

The Matconvnet toolbox [22] was used for the implementation of our CNN model. We trained our network using stochastic gradient descent on our dataset which included 30,000 grayscale frames and their optical flow patterns. The batch size and iteration were 580 and 2500 respectively. To prevent any bias, input data were shuffled and randomly chosen for each batch. Since the pre-trained networks were used for our experiments, the learning rate was initialized to be \( 10^{ - 4} \), and every 1000 epochs it was reduced by 0.1, and the momentum was set to 0.9. We used multi-GPU (Nvidia) for training to accelerate the training computational speed. The trained networks by simulated data then were used to predict camera pose from actual colonoscopy videos.

3 Dataset

3.1 Simulated Video

The simulated colonoscopy video frames were generated by the CSIRO colonoscopy simulator, which is explained in [23]. The simulator uses a 3D analytical model of the colon, with a haptic device allowing inspection of the simulated colon. The parametric mathematical model of the colon geometry embedded in the simulator allowed us to generate realistic human colonoscopy videos. The simulator utilized OpenGL to simulate realistic colonoscopy video based on the model and camera pose, which was used as ground truth in this paper. Appearance parameters such as illumination and specular reflection also modeled in simulator software to generate realistic colonoscopy frames.

We generated 30,000 frames from 15 different simulated colons with different structures, and a variety of possible camera motions. Each frame’s size was 1352 × 1080 pixels, and the simulator recording rate was 30 fps.

3.2 Real Colonoscopy Video

We predicted the camera motion for actual colonoscopy video frames with our trained CNN on five segments from five different patients, each of which covered around 20 cm of colon. The videos were captured by a 190HD Olympus endoscope, with 50 fps (frame size was 1352 × 1080 pixels). In general 2500 vivo frames were used for validation.

4 Experiments and Results

4.1 Simulated Video

The networks were trained with 80% of data (chosen from different videos), which were shuffled to prevent bias in the training phase. The trained networks with optical flow pattern and two consecutive grayscale frames were tested on the remaining data. In our study to demonstrate the performance of CNNs in comparison to a feature based algorithm; we estimated motion features between consecutive frames, removed uninformative frames [20] and computed camera translation and rotation with respect to the colon surface [18]. The results including the root mean square error (RMS) and standard deviation (STD) from the ground truth for both CNNs and SfM are shown in Fig. 3. The results indicate the higher performance of modified GoogLeNet trained by grayscale frames in comparison to other methods.

Fig. 3.
figure 3

The root mean square error (RMS) and standard deviation (STD) between ground truth and the camera rotations and translations estimated by SfM, modified AlexNet trained by optical flow and grayscale frames, and modified GoogLeNet trained by grayscale frames.

To investigate the ability of the trained networks in generalizing the results, the camera poses were computed using our trained networks from a simulated video consisting of 450 frames which were never observed by the networks during training or validation. The outcome for the distance traveled by the colonoscope camera along the Z direction is shown in Fig. 4, which demonstrates the high performance of the modified GoogLeNet trained by gray scale images.

Fig. 4.
figure 4

The comparison for generalization of camera motion estimation by AlexNet when optical flow and frames were used for training and GoogLeNet when frames were used for training on a dataset that has not been seen before. Here, GoogLeNet shows better performance.

4.2 Validation Using a Colonoscopy Phantom

Prior to validating our trained network on actual colonoscopy frames which were obtained from patients, we estimated the camera pose when colonoscope traveled back and forth in a straight phantom. It started from a start point, which represented as frame (s) in Fig. 5, and returned to the same place frame (e). Results for the distance traveled by camera in Z direction from different networks are shown in Fig. 5. The modified GoogLeNet which was trained by frames shows the lowest drift in comparison to SfM (D2) and the AlexNet when it was trained by optical flow (D1).

Fig. 5.
figure 5

Distance traveled by camera in the Z direction estimated by modified AlexNet, GoogLeNet and SfM (D1 and D2 represent the drift in camera motion estimation by modified AlexNet and SfM respectively), the GoogLeNet trained by frames shows very low drift.

4.3 Application to Actual Colonoscopy Video

Actual colonoscopy videos from different parts of colons were chosen, specifically when the camera moved back and forth (a common practice during colonoscopy) to allow validation. We estimated the distance camera traveled in the Z direction by SfM, and our CNN methods. Figure 6 represents a qualitative evaluation of the methods in the Z coordinate. During one typical examination, the colonoscope was moved back and forth during withdrawal (video in supplementary materials). The graph shows the estimation of Z coordinate along the center line for three different methods (see legend). The image inserts compares different frames that are estimated to be from the same Z location. The orange frame 54 (left insert) was visually closer to frame 166 (top magenta on right inserts) than to frame 141 (green image on right insert), suggesting that the CNN-based method is more accurate than the previous SfM approach. In addition, the CNN method estimated that frame 215 (magenta, bottom right insert) was also at the same location as frame 54 (orange left insert), which visually matches, whereas a drift was observed for the SfM method.

Fig. 6.
figure 6

The camera translation in the Z direction estimated by modified AlexNet, GoogLeNet and SfM on an actual colonoscopy video frames. Frames in orange, green and magenta are chosen to be from same Z height, but we visually understand that the closest frame to orange is the first magenta on the right inset. The modified GoogLeNet trained by grayscale frames and modified AlexNet trained by optical flow are showing better results in comparison to SfM. (color figure online)

5 Discussion and Conclusion

In this paper, we presented a method to estimate the relative camera pose with six DoF from actual colonoscopy video frames. We used two separate new approaches: one modified and trained GoogLeNet using two consecutive grayscale frames as input; the other modified and trained AlexNet and used the optical flow pattern (SIFT flow) and consecutive frames (for the sake of comparison) as input. The networks, trained by simulated data were validated on simulated and actual colonoscopy frames. Our results, which are presented in Fig. 3 showed that the network which was trained with two consecutive frames could outperform the one which was trained by optical flow. In addition, modified GoogLeNet which was trained by frames had better performance in generalizing results for frames that had not been observed in training or validation stage in comparison to the one which was trained by optical flow, as it shown in Fig. 4. Some colonoscopy frames are feature-poor, thus it is hard to find accurate matches between frames, and that rendering the conventional SfM approach is inaccurate. In contrast, CNN-based approach is more robust to these issues, and resulting in higher accuracy Fig. 3.

The computational time for estimating relative camera pose from a trained network was 0.1 s on average, whereas SfM with a bundle adjustment as optimizer took three seconds when Matlab scripts were used.

In this study, we had the ground truth for camera pose from the simulator software, which we used to generate thousands of frames with a variety of camera motions incorporating translation and rotation for training and validation. Previously, others used a robot hand, magnetic sensor [14] or SfM algorithm to estimate camera pose as ground truth to train a network [15]. Any error in calibrating the sensor or estimating camera pose by SfM as ground truth could result in false data for training.

Our results on actual colonoscopy which presented in Fig. 6 indicate the high performance of CNNs in comparison to SfM for estimating the distance in the Z direction that a colonoscope camera traveled in returning to a previously seen location.

One of the main challenges in our work was transfer learning from simulated to actual frames domain. Although we could obtain remarkable results using simulated data and pre-trained networks, as a part of our future work we aim at implementing domain transfer method to improve our current results. We also investigate the performance of other networks such as visual geometry group (VGG) and will propose our network to estimate colonoscope pose.