Keywords

1 Introduction

Deep learning achieves impressive performance in many machine learning tasks, such as classification  [14, 26], semantic segmentation  [24], and object detection  [13]. Those remarkable models rely on large high-quality datasets, such as ImageNet  [7], Cityscape  [5] and Pascal VOC  [10]. However, the amount and quality of labelled data for training is often a limiting factor. In medical imaging, annotating real data is a challenging and tedious task; besides, medical data are usually subject to strict privacy rules that impose a limitation on sharing. A solution to this problem is generating synthetic data in a large quantity within controlled conditions. However, the lack of realism of synthetic data might limit the usefulness of the trained models when applied to real data. In this study, we propose a novel, ConvNet based model to increase the realism of synthetic data. We specifically work on simulated colonoscopy videos, but our approach can be expanded to other surgical assistance simulations.

Fig. 1.
figure 1

Forward cycle and identity loss of OfGAN. \(G_{next}\) transforms the input to the next frame of the target domain. It is trained by the temporal-consistent loss which forces the reserved generation (\(G_{self}\)) to be same as the ground-truth synthetic frame.

Deep convolutional networks achieve remarkable performance on extracting low-dimensional features from image space  [3]. Transforming one image to another requires the model to “understand” both the input and output domain in spatial domain. However, not only involving the spatial domain, our video transformation task overcomes three more challenges: (1) How to transform the input to the target domain while preserving the original annotation. (2) How to capture the temporal information between frames to form a consistent video. (3) How to synthesize near-infinite colonoscopy frames.

Generative Adversarial Networks (GANs)  [12, 20, 23, 25, 28, 29] fill the gap of generating and transforming high-quality images  [28]. Generally, GANs consist of a generator and a discriminator. The generator is trained to generate a sample approximating the target distribution while the discriminator learns to judge the realness of the given sample. An elaborate adversarial training makes it possible to fit or transform a complex distribution.

The domain distributions play a vital role in transformation. Hence, directly transforming colonoscopy images to another domain is challenging when the distance in between is significant. Recently, Shrivastava et al.   [25] refined synthetic small-sized grayscale images to be real-like through their S+U GAN. After that, Mahmood et al.   [20] applied the idea of S+U GAN to remove patient-specific feature from real colonoscopy images. Both mentioned methods employ the target domain in grayscale, which dramatically reduces the training burden.

Combining adversarial training with paired images  [1, 15, 23] usually fulfills impressive results. However, rare paired datasets compel researchers to tackle unpaired datasets, similar to our case, Zhu et al.   [29] proposed the powerful Cycle-consistent GAN (CycleGAN), which trained two complemented generators to form the reconstruction loss. Oda et al.   [21] transformed endoscopy CT images to the real domain by using CycleGAN with deep residual U-Net  [24]. To replace surgical instruments from surgical images, DavinicGAN  [18] extended CycleGAN with attention maps. Although, these methods achieve limited success and are unable to achieve temporal consistency in the video dataset. To solve video flickering, Engelhardt et al.   [9] combined CycleGAN with temporal discriminators for realistic surgical training. But it is difficult to ensure high temporal consistency with only image-level discrimination. In terms of unpaired video-to-video translation, existing methods  [2, 4] on general datasets utilized similar structures as CycleGAN and novel networks for predicting future frames. However, these methods do not restrict the transformed structure to its origin; instead, they encourage novel realistic features. Our OfGAN improves CycleGAN to temporal level by forcing the generator to transform the input frame to its next real-alike frame while restricting the optical flow between two continuous output frames to be identical with their input counterparts. This setup achieves remarkable performance at pixel-level in spatial as well as temporal domain transformation.

The contributions of this exposition are:

  1. 1.

    Optical Flow GAN: Based on the standard cycle-consistent structure, we create and implement the OfGAN, which is able to transform the domain while keeping the temporal consistency of colonoscopy videos and rarely influencing the original optical flow annotation.

  2. 2.

    Real-enhanced Colonoscopy Generation: Our method can be incorporated in a colonoscopy simulator to generate near-infinite real-enhanced videos. The real generated videos possess very similar optical flow annotation with the synthetic input. Frames inside the transformed videos are consistent and smooth.

  3. 3.

    Qualitative and Quantitative Evaluation: The model is evaluated on our synthetic and a published CT colonoscopy datasets  [23] both qualitatively and quantitatively. The transformation can be applied to annotation and thus create labels associated with the new realistic data.

2 Methodology

Let us consider that we are given a set of synthetic colonoscopy videos \(S={\textit{\textbf{s}}}\) and real colonoscopy videos \(R={\textit{\textbf{r}}}\), where \(\textit{\textbf{s}}={s_1, s_2,\dots ,s_n}\) and \(\textit{\textbf{r}}={r_1, r_2,\dots ,r_m}\), then \(s_n\) represents the n-th frame in the synthetic video and \(r_m\) represents the m-th frame in the real video. It should be noted that there is no real frame corresponding to any synthetic frame. Furthermore, we have ground-truth optical flow \(F={\textit{\textbf{f}}}\) for all synthetic data, where \(\textit{\textbf{f}}={f_{1,2},\dots , f_{n-1,n}}\) and \(f_{n-1,n}\) indicates the ground-truth optical flow between frame \(n-1\) and n. The goal is to learn a mapping \(G:S\rightarrow R^\prime \) where \(R^\prime \) is a set of novel videos whose optical flow is identical to S while keeping the structure of S unchanged. To achieve this, we follow cycle-adversarial  [29] training by using two generative models \(G_{self}\) and \(G_{next}\), their corresponding discriminators \(D_{syn}\) and \(D_{real}\) as well as an optical flow estimator Op to form an optical flow cycle-consistent structure.

2.1 Temporal Consistent Loss

Different from the reconstruction loss in CycleGAN, our model tries to reconstruct the next frame of the given distribution. More specifically, forward cycle is connected by two mapping functions: \(G_{next}:s_n\rightarrow r^\prime _{n+1}\) and \(G_{self}:r^\prime _{n+1}\rightarrow s^{rec}_{n+1}\). \(G_{next}\) tries to transform a given synthetic frame \(s_n\) to be similar to a frame from the real domain, at the same time it predicts the next frame to obtain \(r^\prime _{n+1}\). In the reverse mapping, \(G_{self}\) transforms \(r^\prime _{n+1}\) to \(s^{rec}_{n+1}\). Further, our temporal consistent loss narrows the gap between \(s^{rec}_{n+1}\) and \(s_{n+1}\). The generator \(G_{next}\) performs spatial and temporal transformation simultaneously while \(G_{self}\) only involves spatial transformation. Besides we have a backward cycle obeying the reverse mapping chain: \(r_m \rightarrow G_{self} \rightarrow s^\prime _m \rightarrow G_{next} \rightarrow r^{rec}_{m+1}\). We use \(\ell _1\) loss to mitigate blurring. The overall temporal consistent loss is given by:

$$\begin{aligned} \mathcal {L}_{cyc}(G_{next}, G_{self}) =&\, \mathrm {E}_{s\sim Pdata(S)}[||G_{self}(G_{next}(s_n)) - s_{n+1}||_1] \nonumber \\&+ \mathrm {E}_{r\sim Pdata(R)}[||G_{next}(G_{self}(r_m)) - r_{m+1}||_1]. \end{aligned}$$
(1)

2.2 Adversarial Loss

Adversarial loss  [12] is utilized for both mapping functions described in the previous section. For the mapping function \(G_{next}:s_n\rightarrow r^{\prime }_{n+1}\) the formula of adversarial loss is:

$$\begin{aligned} \mathcal {L}_{adv}(G_{next}, D_{real}, S, R)=&\, \mathrm {E}_{r\sim Pdata(R)}[\log D_{real}(r)] \nonumber \\&+ \mathrm {E}_{s\sim Pdata(S)}[1 - \log D_{real}(G_{next}(s))] \end{aligned}$$
(2)

For the reverse pair \(G_{self}\) and \(D_{syn}\), the adversarial loss is \(\mathcal {L}_{adv}(G_{self}, D_{syn}, R, S)\) where the positions of synthetic and real data are interchanged.

2.3 Perceptual Identity Loss

Nevertheless, the temporal-consistent loss itself is insufficient to force each generator to generate its targets. We use identity loss to force \(G_{next}\) to strictly generate the next frame and \(G_{self}\) to transform the current frame. Furthermore, we find measuring the distance on the perceptual level achieves better results. Finally, the formula is as follows:

$$\begin{aligned} \mathcal {L}_{idt}(G_{next}, G_{self}) =&\,\mathrm {E}_{r\sim Pdata(R)}[\theta (G_{next}(r_m)), \theta (r_{m+1})] \nonumber \\&+\mathrm {E}_{s\sim Pdata(S)}[(\theta (G_{self}(s_n)), \theta (s_n)] \end{aligned}$$
(3)

where the \(\theta (\cdot )\) indicates the perceptual extractor.

2.4 Optical Flow Loss

In addition to the above operations in the unsupervised situation, the optical flow loss utilizes supervised information to preserve annotation and stabilize the training. We restrict each two continuous real-alike frames to have the same optical flow as their corresponding synthetic frames, as shown in Fig. 1. The optical flow loss is:

$$\begin{aligned} \mathcal {L}_{op}(G_{next}) = \mathrm {E}_{s\sim Pdata(S),f\sim Pdata(F)}[||Op(r^\prime _n, r^\prime _{n+1}) - f_{n,n+1}||_1], \end{aligned}$$
(4)

where the \(Op(\cdot )\) represents a non-parameteric model for optical flow estimation and \(r^\prime _n = G_{next}(s_n)\).

Therefore, the overall loss function can be presented as:

$$\begin{aligned} \mathcal {L}(G_{next}, G_{self}, D_{syn}, D_{real})\, =\,&\mathcal {L}_{adv}(G_{next}, G_{self})\, +\, \lambda \mathcal {L}_{cyc}(G_{next}, G_{self}) \nonumber \\&+\beta \mathcal {L}_{idt}(G_{next}, G_{self}) + \sigma \mathcal {L}_{op}(G_{next}), \end{aligned}$$
(5)

where we have \(\lambda \), \(\beta \) and \(\sigma \) as the importance of each term. The target is to solve the min-max problem of

$$G_{next}^*, G_{self}^* = \arg \min _{G_{next}, G_{self}}\max _{D_{syn}, D_{real}} \mathcal {L}(G_{next}, G_{self}, D_{syn}, D_{real}).$$

2.5 Implementation Details

To be fair with competing methods, we adopt many training parameters from CycleGAN. We use an encoder-decoder structure for the generators and PatchGAN  [19] for discriminators. Both generators consist of two down-sample and two up-sample layers with six residual blocks in between. For extracting perceptual features, we use the output of the second convolution block of pre-trained VGG-13  [26] on ImageNet. Similarly, the optical flow is estimated via pre-trained PWC-Net  [27]. Furthermore, to optimize the network, we employ Adam  [17] optimizer with beta equal to (0.5, 0.999) and a learning rate of \(2e^{-4}\). The input frames are resized to \(256 \times 256\) while corresponding optical flow is re-scaled to the proper value. We set \(\lambda = 150\), \(\beta = 75\) and \(\sigma = 0.1\). The framework is implemented in PyTorch  [22] and trained on 4 Nvidia P100 GPUs for 100 epochs.

3 Experiments

The synthetic data we utilized is simulated by a colonoscopy simulator  [6]. We extracted 8000 synthetic colonoscopy frames from five videos with ground-truth optical flow and 2741 real frames from 12 videos for training. Similarly, for testing, 2000 unknown synthetic frames are captured from two lengthy videos. The real data is captured from patients by our specialists. We perform fish-eye correction for all the real data and discard the real frames with extreme lighting conditions, wall-only images, and blurred images. Subsequently, we are left with 1472 real images for training. Further, we also test our model on a published CT colonoscopy dataset  [23] qualitatively.

We present the qualitative and the quantitative evaluation on our test results. The qualitative evaluation focuses on the single frame quality and temporal consistency in a subjective manner. For quantitative analysis, we use an auxiliary metric, Domain Temporal-Spatial score (DTS), to measure temporal and spatial quality simultaneously.

Fig. 2.
figure 2

Qualitative evaluation of four successive frames of each model. Each row, from top to bottom left to right, are input frames, results from the baseline, standard CycleGAN plus our optical flow loss, temporal consistent loss only, complete OfGAN with \(\sigma =0.1\) and \(\sigma =5\). Red rectangles highlight unseen features of one front frame. Differences are best viewed on zoom-in screen. (Color figure online)

3.1 Qualitative Evaluation

The single-frame quality measures are two-fold. On the one hand, it measures if the transformed frame looks much more like real ones while, on the other hand, it evaluates if it contains less noise. For temporal consistency, we select four continuous frames and mainly concentrate on inconsistency among them. We regard the famous CycleGAN as our baseline model and furnish four models for ablation study. Results show that merely adding optical flow loss to the model does not improve rather results in worse performance on both spatial and temporal quality. The standard cycle structure does not involve in any temporal information, besides no spatial and temporal information can be learned at the same time. As a result, the black corner turns to be more obvious, and more inconsistent white spots emerge. Furthermore, only applying temporal-consistent loss (Fig. 2 row 1, column 5–8) intervenes in the converging of original training, which produces large scale mask-like noises. The combination of both optical flow loss and temporal-consistent loss gives much more realistic and consistent results (Fig. 2 row 2, column 5–8). Almost no white spots appear on any frames where the colon wall looks more bloody. A pale mask-like noise arises on the right. In terms of single frame quality (Fig. 3b), our method achieves better realness than the baseline method. By comparison, it is obvious that our method successfully removes black corners and complements the detail in the deeper lumen. Besides, the white spots are rare in our results. The surface of the baseline method is so glossy that it looks far different from a human organ. On the contrary, our method has more vivid light effect.

Fig. 3.
figure 3

(a) Qualitative evaluation of CT colonoscopy outputs, CT input (top), and our transformed results (bottom). (b) Qualitative assessment of the selected frame between two selected pairs from the baseline (top) and our method (bottom). Zoomed zones come from the detail inside the nearby red rectangles. (c) Qualitative evaluation on five continuous CT frame pairs, CT input (top), and our transformed results (bottom). (The images are best viewed on screen using zoom functionality). (Color figure online)

The choice of parameter \(\sigma \) is a trade-off between consistency and realness. From values 0.1 to 5, results vary from the best realistic to the best consistent. Hence, we can adjust it depending on specific application scenarios.

We also test our method on CT colonoscopy videos (Fig. 3a top row) whose surface is coarse and texture-less compared with our synthetic data. In this case, we have no ground-truth optical flow of the input; instead, we use the estimated optical flow as the ground-truth for training. Our method successfully colors the surface to be realistic. Besides, it also removes the coarse surface and adds more realistic reflections of light inside the dark lumen. The lack of blood vessels is due to our non-blood-vessel-rich real data. Sequential frames (Fig. 3c) show that the innovative light reflection is consistent throughout these frames. In addition, no apparent noise nor inconsistent features appear.

3.2 Quantitative Evaluation

The quantitative evaluation should combine temporal, spatial, and domain distance measurements to overcome the trade-off problem. Hence, we utilize the DTS weighted by four normalized metrics, Average End point Error (AEPE)  [8], average perceptual loss (\(\mathcal {L}_{perc}\))  [16], and average style loss (\(\mathcal {L}_{style}\))  [11] as the auxiliary metric for combining spatial and temporal dimensions. AEPE is used to measure how well two continuous outputs possess the same optical flow as the corresponding inputs, which also indicates the consistency of temporal output. We use AEPE-GT (\(E_{gt}\)) and AEPE-Pred (\(E_{pred}\)), which are AEPEs between the result and ground-truth, and estimated optical flow of the input. \(\mathcal {L}_{perc}\) and \(\mathcal {L}_{style}\) is for spatial quality and domain distance. The weight selection is depended on the prior of each term. The coefficients are set up empirically based on the importance of each term. To calculate the mean, we randomly select ten samples from the entire real dataset for each test data. Finally, these metrics are normalized on 36 test cases with different hyper-parameters. The smaller the DTS, the better the performance. The overall formula of DTS is:

$$\begin{aligned} DTS = \frac{3}{8}\mathcal {N}(\mathrm {E}_{gt}) + \frac{1}{8}\mathcal {N}(\mathrm {E}_{pred}) + \frac{1}{4}\mathcal {N}(\mathcal {L}_{perc}) + \frac{1}{4}\mathcal {N}(\mathcal {L}_{style}) + 0.5, \end{aligned}$$
(6)

where \(\mathcal {N}(\cdot )\) means normalization, adding 0.5 to make every value positive.

Table 1. Quantitative evaluation on test cases, referring to Fig. 2.

The baseline method sacrifices the realness to achieve good consistency while only using temporal consistent loss is contrary, and both cases obtain a worse DTS (Table 1). Our method takes both advantages, even though not the best, and beats the baseline on \(\mathrm {E}_{gt}\), \(\mathcal {L}_{perc}\), \(\mathcal {L}_{style}\), and DTS. Notice that \(\mathrm {E}_{gt}\) relies on the accuracy of the optical flow estimator, PWC-Net, as it has achieved state-of-the-art \(\mathrm {E}_{gt} = 2.31\) on MPI Sintel  [27]. Even though we use different dataset, we think our \(\mathrm {E}_{gt}=1.22\) (Table 1 last row) indicates the optical flow sufficiently identical to the ground-truth.

4 Conclusion

Our proposed OfGAN extends labeled synthetic colonoscopy video to real-alike ones. We have shown the performance of our OfGAN on our synthetic dataset and published CT datasets. The transformed dataset has outstanding temporal and spatial quality, which can be used for data augmentation, domain adaptation, and other machine learning tasks to enhance the performance. In term of the limitation, the performance of the proposed method might reduce if it fails to transform a frame correctly in a sequence. This can cause a dramatic effect on generating long videos, which needs to be dealt with in the future.