Keywords

1 Introduction

Surgical instrument segmentation is fundamental to Augmented Reality (AR) in image-guided robot-assisted surgery (RAS) [11] and has been an active topic of research, with convolutional neural network (CNN)-based methods surpassing prior methods by a significant margin [2, 5, 10]. CNN-based methods depend on the availability of annotated surgical data, which may be difficult to obtain [9]. Their performance has been reported for publicly available ex-vivo and porcine in-vivo RAS surgeries, but not in human RAS.

Recently, many generative approaches have been proposed to mitigate the problem of limited clinical labelled data [4, 7, 13]. For laparoscopic instrument segmentation, [13] proposed a generative adversarial network (GAN)-based method to use a small amount of labelled data. In [7], labelled data from cadaver surgery was transferred to in-vivo surgery. Then a separate segmentation model was trained using either the translated cadaver data or translated in-vivo data to the cadaver domain. In [4], an image-to-image (I2I) mapping of simulated to real surgical instruments was proposed, with blending into the camera background. In the above methods, the translated data was used to train a segmentation model. Finding validated quantitative metrics for the quality of translated data is difficult and is the topic of on-going research [6, 15]. In many cases, the generative models change the surgical instruments’ shape and introduce artefacts while the overall accuracy decreases [Fig. 1]; this is undesirable for clinical application. Hence, a segmentation strategy leveraging the power of generative models to alleviate the problem of unlabelled clinical data while addressing the predominant current challenges of generative models is imperative.

Therefore, in the current paper, we present a joint unpaired I2I mapping and segmentation strategy for better generalizability of a surgical instrument segmentation model to a domain with no labelled data. The generative and segmentation models are trained together and reach convergence in a synergistic manner. The generative model maps from a source domain with labelled data to a target domain with unlabelled data with constant feedback from the segmentation model. The segmentation model trains in parallel on the generated target images and on the labelled source images. The convergence criterion of this joint-system is the segmentation quality. The segmentation model also regularizes the generative model that can otherwise change the shape of the surgical instruments during the I2I mapping. We call our method coSegGAN. The closest method to it is presented in [7]. However, unlike in [7], our segmentation model is not pre-trained. It provides feedback to the generators as it learns using the generated data, thus seeing much more varied data. Unlike prior work, we provide an explicit shape constraint on latent space to provide intermediate supervision during generative training. Through evaluation on real surgical sequences and publicly available datasets, we show that coSegGAN has better generalizability than existing methods. The main contribution of the paper is presenting a joint generation and segmentation framework that provides state of the art (SOTA) results to segment surgical instruments on unlabelled data. The method performs better than using a generative model for data augmentation as a separate step. To the best of our knowledge, this is the first method that segments surgical instruments with no labelled data by jointly training the generative and segmentation model as a joint feedback system to perform an I2I mapping between the labelled and the unlabelled domain.

Fig. 1.
figure 1

(Left) Table showing the limited generalizability of state of the art (SOTA) methods across domains. Mean Dice scores are shown for different methods on datasets Endovis, UCL, and Surgery. (Right) Figures showing the problem with cycleGAN where the intermediate generated output can be unrealistic while overall cycle consistency loss is low. Figures A, B, and C show the original, translated, and reconstructed domain.

2 Methods

2.1 Network Details

The generative part of coSegGAN uses a cycleGAN like architecture with two generators and two discriminators [16]. Let \(x_{ai}\) and \(x_{bi}\) denote the two \(i^{th}\) images in the domains \(\psi _{A}\) and \(\psi _{B}\), respectively, and \(x_{a}\) and \(x_{b}\) denote the set of all images in domains a and b, respectively. \(y_{ai}\) denotes the corresponding individual label for the \(i^{th}\) \(x_{ai}\) image and \(y_{a}\) is the set of all such labels. \(G_{A}\) and \(G_{B}\) are the two generators estimating the mappings, with \(G_{A}: x_{b\rightarrow a}\) and \(G_{B}: x_{a\rightarrow b}\), respectively. The discriminator \(D_{A}\) is responsible to discriminate between given true images in domain \(\psi _{A}\) and generated images \(G_{A}(x_{b})\). Similarly, \(D_{B}\) is responsible for discriminating between the true domain \(\psi _{B}\) and generated images \(G_{B}(x_{a})\). Both \(G_{A}\) and \(G_{B}\) have a U-Net-like architecture [12] with a contracting and expanding path. The contracting path consists of four \(4~ \times ~ 4\) convolutional layer with stride 2 + Leaky ReLu + Instance normalization [14] blocks where in each subsequent block the output is halved and the channel numbers are doubled. The expanding path consists of three blocks with each block having an up-sampling layer + \(4~ \times ~ 4\) convolution with a stride of 1 + ReLu activation + Instance normalization. The output of each block was concatenated with the low-level features from the contracting path by skip connections and then passed as an input to the next block. The output of the final block was passed though a convolutional layer followed by a tanh activation. For the discriminator, we used a patchGAN similar to [16]. For the segmentation model (S) in coSegGAN we used the original U-Net architecture but with 16 base filters to prevent over-fitting and to reduce computation. This did not decrease the performance of segmentation when compared to the original U-Net, as determined empirically.

Fig. 2.
figure 2

Overview of the training setup for the generation and segmentation side. The diagram on the left shows the A to B mapping side of the cycleGANs that is modified to incorporate shape loss and structural loss during training. The grey blocks indicate the different losses used for updating the weights of the generator. On the right side, the input and loss used for the segmentation model training are shown. Any network indicated by ‘\(\sim \)’ in this figure indicates it has frozen weights.

2.2 Training Strategy

We trained the generators, discriminators & the segmentation model in an alternative fashion. In the first run, the weights through the generators, \(G_{A}\) and \(G_{B}\) were back-propagated while freezing the weights of the discriminators and the segmentation model. Then in the next run, the discriminators as well as the segmentation model, \(D_{A}\), \(D_{B}\), & S were trained and updated. For training S, both \(x_{a}\) and \(G_{B}(x_{a})\) were fed as the input. Since the generated images are translated versions of the real image, the corresponding labels for \(G_{B}(x_{a})\) are the same as \(x_{a}\). Note that S is seeing different variations of the generated target domain images in every epoch because the generators and S are learning in parallel. While the quality of the I2I mapping from the generators increases, the quality of the images seen by S also increases. Details can be seen in Fig. 2.

2.3 Loss Functions

Segmentation Model. In order not to overwhelm the loss with the higher number of background pixels, we used an \(\alpha \)-balanced variant of focal loss, \(\mathcal {L}_{foc}\) [8], a modification of cross-entropy, where the \(\gamma \) factor controls the contribution of high-probability samples in the loss calculation. We used the hyper-parameters \(\gamma \) and \(\alpha \) as 2.0 and 0.25, respectively. The total segmentation loss, \(\mathcal {L}_{seg}\), is

$$\begin{aligned} \mathcal {L}_{seg} = \mathcal {L}_{foc}\left( x_{a}, y_{a}\right) + \mathcal {L}_{foc}\left( G_{B}\left( x_{a}\right) , y_{a}\right) \end{aligned}$$
(1)

Generative Model. For cycleGAN we used an adversarial loss, \(\mathcal {L}_{GAN}\), and a pixel-level cycle consistency loss \(\mathcal {L}_{cyc}\) proposed in [16]. Although \(\mathcal {L}_{cyc}\) reduces the number of possibilities when mapping across domains and regularizes the cycleGAN, it does not suffice to preserve the higher-level semantics in the image. This can change the shapes of surgical instruments during the translation, which is not desirable. Therefore we included feedback from the segmentation model in the total generative loss. This penalizes the generation of unrealistic surgical instrument shapes in \(G_{B}\left( x_{a}\right) \). Since we are interested in the mapping from \(x_{a}\) to \(x_{b}\), which later is fed as an input to the segmentation model, we included this constraint only on the generator \(G_{B}\). This shape preservation loss, \(\mathcal {L}_{shape}\), is

$$\begin{aligned} \mathcal {L}_{shape} = \mathcal {L}_{foc}\left( G_{B}\left( x_{a},\right) y_{a}\right) \end{aligned}$$
(2)

In cycleGAN models, \(\mathcal {L}_{cycTotal}\) is the sum of two cycle consistency losses such that, \(\mathcal {L}_{cycTotal} = \mathcal {L}_{cyc}(x_{a}, G_{A}(G_{B}(x_{a}))) + \mathcal {L}_{cyc}(x_{b}, G_{B}(G_{A}(x_{b})))\). These losses enforce pixel level constraints between the original inputs \(x_{a}\) and \(x_{b}\) and reconstructed outputs \( G_{A}(G_{B}(x_{a}))\) and \( G_{A}(G_{B}(x_{a}))\), where the two GANs are optimized together. There is no intermediate supervision after each generative step \(G_{A}: x_{b\rightarrow a}\) and \(G_{B}: x_{a\rightarrow b}\). Thus \(G_{A}\) and \(G_{B}\) can produce unrealistic images while the total \(\mathcal {L}_{cyc}\) is reduced (Shown in Fig. 1, (right)). In particular, the mapping across domains should change only the ‘appearance’ of the scene while retaining the domain-invariant structural elements. To preserve the structural properties of the scene across domains, we introduce an explicit, intermediate, feature level, latent space loss. This latent space loss, and the total generated loss, are:

$$\begin{aligned}&\mathcal {L}_{structure} = \mathbb {E}\left[ \left\Vert e_{A}(x_{a}) - e_{B}(G_{B}(x_{a})) \right\Vert _{1}\right] + \mathbb {E}\left[ \left\Vert e_{B}(x_{b}) - e_{A}(G_{A}(x_{b})) \right\Vert _{1}\right] \end{aligned}$$
(3)
$$\begin{aligned}&\mathcal {L}_{generator} = \lambda _{1}\mathcal {L}_{GANTotal}+\lambda _{2} \mathcal {L}_{cycTotal}+ \lambda _{3} \mathcal {L}_{shape} + \lambda _{4}\mathcal {L}_{structure} + \lambda _{5}\mathcal {L}_{I}\,\,. \end{aligned}$$
(4)

where, \(e_{A}\) and \(e_{B}\) are encoders in \(G_{B}\) and \(G_{A}\), respectively, \(\mathcal {L}_{GANTotal} = \mathcal {L}_{GAN}(G_{B}, D_{B}, x_{a}, x_{b}) + \mathcal {L}_{GAN}(G_{A}, D_{A}, x_{b}, x_{a})\) and \(\mathcal {L}_{I}\) is the identity mapping loss as given in [16]. Values of \(\lambda _{1}\), \(\lambda _{2}\), \(\lambda _{3}\), \(\lambda _{4}\), and \(\lambda _{5}\) are 1, 10, 1, 5, and 1, respectively. These values were tuned during the hyper-parameter tuning phase.

Training Details and Hyper-Parameters. For training and testing our models, we use Tensorflow & Keras API on a NVIDIA Tesla V100 GPU (16 GB). For training the proposed models, we used a batch size of 8, and Adam optimizer with \(\beta _{1}\) and \(\beta _{2}\) of 0.9 and 0.999, respectively, with a learning rate of \(10^{-3}\). We trained our models for 100 epochs (approximately 12 h) and saved weights of the segmentation model with the highest validation Dice score [17]. Code is available at: https://github.com/tajwarabraraleef/coSegGAN.

3 Experiments

Datasets: Endovis Challenge, 2017, in-vivo Dataset [1]: It is a porcine surgery procedure with a training set consisting of 8 videos of 225 frames each and a test set consisting of 8 videos of 75 frames and 2 videos of 300 frames each. We used 6 videos for training and 2 videos for validation from the training set. We used 8 videos from the test set for testing; these were not used for validation. In the paper, we refer to this dataset as Endovis. In Table 1 Endovis is abbreviated as Endo.

UCL ex-vivo Dataset [3]: The dataset consists of 14 videos with different animal tissues as background. Similar to [3], we used 8, 2 and 4 videos for training, validation and testing, respectively.

Prostatectomy Dataset. We prepared the training dataset from 5 videos of robot-assisted radical prostatectomy procedures with the da Vinci Si surgical system from Vancouver General Hospital, Vancouver, Canada. We manually selected 1327 frames to isolate surgical instruments from other visible objects in the surgical field of view. These frames do not have corresponding labels. To evaluate the performance of the various methods on actual surgical data, we prepared a test set of 182 frames taken from 4 different surgeries independent from the training set. The test data represents approximately \(12\%\) of the entire surgical data used. We manually labelled surgical instruments in these frames only for the purpose of testing coSegGAN and existing methods. All the frames were center cropped to give a final size of \( 721 \times 503\) pixels. We will refer to this dataset as Surgery in the rest of the paper. Ethics to collect data was obtained from the Institutional Clinical Research Ethics Board. For all three datasets, we resized the frames to \(256 \times 256\) to accelerate the computation.

Evaluation. We compared coSegGAN with Ternausnet, the best performing method in the Endovis Challenge [1] for binary segmentation and RASnet, reporting a mean \(94.65\%\) Dice coefficient on Endovis. For a fair comparison to coSegGAN, we performed data augmentation with the cycleGAN architecture given in Sect. 2. The cycleGAN model was run for 50 epochs in all cases as it converged in 50 epochs. After cycleGAN I2I translation from source (with labels) to target domain, the SOTA segmentation models were trained with both the translated and original domain data. We also performed an ablation experiment comparing coSegGAN with and without the proposed \(\mathbf{L} _{structure}\) loss. We refer to RASnet, Ternausnet, and our U-Net variant with focal loss, trained using the augmented data generated from a separate cycleGAN (unlike our joint strategy) as \(RASnet+\), \(Ternausnet+\) and \(U\text {-}Net_{FL}+\) respectively. The coSegGAN network without \(\mathbf{L} _{structure}\) is called \(coSegGAN-\). We performed evaluation of four combinations of datasets for labelled and unlabelled domains. For ease of reporting, we refer to Endovis (labelled) + Surgery (Unlabelled), UCL (labelled) + Surgery (Unlabelled), Endovis (labelled) + UCL (Unlabelled), and UCL (labelled) + Endovis (Unlabelled) data combinations as case 1, case 2, case 3, and case 4, respectively. Since, we want to quantify the generalizability of our method across labelled and unlabelled domains, for a particular dataset combination, we also calculated an absolute difference in the Dice scores, \(\varDelta ~Dice\), and absolute difference in Intersection over Union (IoU), \(\varDelta ~IoU\), between labelled domain A and unlabelled domain B. The lower the \(\varDelta ~Dice\) and \(\varDelta ~IoU\), the higher is the generalizability between domains (refer to Table 1).

Table 1. Comparison of Mean Dice and IoU scores of coSegGAN with SOTA methods

4 Results and Discussion

For case 1, the proposed coSegGAN network gave significantly higher Dice (\(92.8\%\)) and IoU scores (\(84.7\%\)) on unlabelled domain B (Surgery) when compared to RASnet+, Ternausnet+ and \(U\text {-}Net_{FL}\)+ which have Dice scores of \(78.1\%\) (IoU = \(64.7\%\)), \(88.7\%\) (IoU = \(80.4\%\)), and \(84.1\%\) (IoU = \(42.5\%\)), respectively. For case 2 as well, the Dice score for coSegGAN on unlabelled domain (Surgery) is \(74.3\%\) (IoU = \(59.8\%\)) while RASnet+, Ternausnet+, and \(U\text {-}Net_{FL}\)+ have lower Dice scores of \(47.8\%\) (IoU =\(33.0\%\) ), \(46.0\%\) (IoU = \(31.3\%\)), and \(45.6\%\) (IoU = \(22.9\%\)), respectively. Similarly, for case 3, the Dice (IoU = \(82.2\%\)) score for coSegGAN on unlabelled data (UCL) is \(90\%\), which is higher than \(RASnet+\), \(Ternausnet+\) and \(U\text {-}Net_{FL}+\) with Dice scores of \(83.3\%\) (IoU = \(71.9\%\)), \(41.7\%\) (IoU = \(29.0\%\)), and \(81.8\%\) (IoU = \(13.6\%\)), respectively. For case 4, the Dice score for coSegGAN on unlabelled Endovis data is \(79.4\%\) (IoU = \(66.8\%\)), which, similar to other cases, is higher than the rest of the methods; Dice scores of \(RASnet+\), \(Ternausnet+\) and \(U\text {-}Net_{FL}\)+ being \(66.8\%\) (IoU = \(52.9\%\)), \(55.0\%\) (IoU = \(52.9\%\)) and \(56.5\%\) (IoU = \(42.0\%\)), respectively.

The \(\varDelta ~ Dice\), for coSegGAN for case 1 is much lower \(0.9\%\) (IoU = \(3.7\%\)) while for \(RASnet+\), \(Ternausnet+\), and \(U\text {-}Net_{FL}\)+ it is \(10.2\%\) (IoU = \(15.2\%\)), \(5.5\%\) (IoU = \(9.5\%\)), and \(33.8\%\) (IoU = \(43.5\%\)), respectively. For case 2, \(\varDelta ~ Dice\) for coSegGAN is \(16.8\%\) (IoU = \(24.4\%\)), while for RASnet+, Ternausnet+ and \(U\text {-}Net_{FL}+\) it is \(44.5\%\) (IoU = \(52.8\%\)), \(49.8\%\) (IoU = \(60.8\%\)) and \(57.2\%\) (IoU = \(64.5\%\)), respectively. For case 3, \(\varDelta ~ Dice\), for coSegGAN is \(3.2\%\) (IoU = \(6.1\%\)), which is much lower than \(RASnet+\), \(Ternausnet+\) and \(U\text {-}Net_{FL}\)+ with \(\varDelta ~ Dice\) of \(5.1\%\) (IoU = \(8.1\%\)), \(51.6\%\) (IoU = \(60.2\%\)), and \(60.7\%\) (IoU = \(31.5\%\)), respectively. For case 4, similarly, the \(\varDelta ~ Dice\) for coSegGAN is \(14.1\%\) (IoU = \(24.3\%\)) when compared to \(RASnet+\), \(Ternausnet+\), and \(U\text {-}Net_{FL}+\) with \(\varDelta ~ Dice\) of \(25.6\%\) (IoU = \(33.0\%\)), \(38.4\%\) (IoU = \(46.6\%\)), and \(18.1\%\) (IoU = \(19.2\%\)), respectively. Consistently higher Dice and IoU on unlabelled data and significantly lower \(\varDelta ~ Dice\) and \(\varDelta ~ IoU\) of coSegGAN show its generalizability when compared to all other methods for all the cases.

For coSegGAN, in cases 2 and 4, when the mapping is from UCL (labelled) to either Surgery or Endovis, the \(\varDelta ~ Dice\) is higher than cases 1 and 3, showing comparatively less generalizability. This could be because the UCL data is an ex-vivo dataset where data distribution potentially differs from a real surgery, with remarkably different lighting and background. Also, there is only one type of surgical instrument visible in the UCL dataset, which might have hindered the mapping to multiple types of instruments.

In the ablation experiment, coSegGAN–, i.e., coSegGAN without the \(\mathbf{L} _{structure}\), showed comparable performance with coSegGAN, except case 4, where the performance of coSegGAN is significantly higher (approximately \(5\%\)) on the unlabelled Endovis dataset. coSegGAN– has higher \(\varDelta ~ Dice\) for all cases except case 4, showing that with the \(\mathbf{L} _{structure}\) loss coSegGAN generalizes better to both labelled and unlabelled datasets.

A qualitative comparison of coSegGAN with other methods for different surgeries can be seen in Fig. 3. As can be seen (column 1), coSegGAN performs better in preserving overall tool structure, with finer details, when compared to other methods. In comparison to \(Ternausnet+\) and \(RASnet+\), the method also produces fewer false positives [Fig. 3 (column 2)]. Although coSegGAN performs better than SOTA methods in identifying tools, it occasionally fails to identify the tool in the presence of blood where surgical instrument blends in with the background. Usually this happens at the image periphery where the region is relatively dark compared to the well-lit image center. Figure 3 (column 4) shows one such failure case.

Fig. 3.
figure 3

Figure showing a qualitative comparison of our method with other methods. It can be seen that overall, our method preserves the shape of the instruments better with fewer false positives. (column 1) Inset showing preservation of instrument shape in our method. (column 4) Inset showing a failure case of our method.

5 Conclusion

We presented a joint generative and segmentation strategy, coSegGAN, that outperforms SOTA methods in its generalization capability to unlabelled domain data. The evaluated SOTA methods use separate I2I mapped data augmentation and segmentation steps. The proposed losses helped to preserve finer tool structure. The method is easy to adapt to other deep learning segmentation methods and thus can significantly improve the existing methods. The method aims to utilize unlabelled surgical data, which is much easier to acquire than labelled data, to improve any instrument segmentation model in a simple yet effective manner. Therefore, coSegGAN has the potential to significantly facilitate surgical translation of current and future surgical tool segmentation methods because it effectively alleviates the problem of unlabelled data. Current testing of coSegGAN has been limited to footage from prostatectomy procedures. A thorough performance analysis for different types of RAS surgeries is part of future work.