1 Introduction

Segmenting surgical instruments and tissue poses a significant challenge in robotic surgery, as it plays a vital role in instrument tracking and position estimation within surgical scenes. Nonetheless, current deep learning models often have limited generalization capacity as they are tailored to specific surgical sites. Consequently, it is crucial to develop generalist models that can effectively adapt to various surgical scenes and segmentation objectives to advance the field of robotic surgery [18]. Recently, segmentation foundation models have made great progress in the field of natural image segmentation. The segment anything model (SAM) [14], which has been trained on more than one billion masks, exhibits remarkable proficiency in generating precise object masks using various prompts such as bounding boxes and points. SAM stands as the pioneering and most renowned foundation model for segmentation. Whereas, several works have revealed that SAM can fail on common medical image segmentation tasks [4, 6, 8, 16]. This is not surprising or unexpected since SAM’s training dataset primarily comprises natural image datasets. Consequently, it raises the question of enhancing SAM’s strong feature extraction capability for medical image tasks. Med SAM Adapter [22] utilizes medical-specific domain knowledge to improve the segmentation model through a simple yet effective adaptation technique. SAMed [23] has applied a low-rank-based finetuning strategy to the SAM image encoder, as well as prompt encoder and mask decoder on the medical image segmentation dataset.

However, evaluating the performance of SAM in the context of surgical scenes remains an insufficiently explored area that has the potential for further investigation. This study uses two publicly available robotic surgery datasets to assess SAM’s generalizability under different settings, such as bounding box and point-prompted. Moreover, we have examined the possibility of fine-tuning SAM through Low-rank Adaptation (LoRA) to examine its capability to predict masks for different classes without prompts. Additionally, we have analyzed SAM’s robustness by assessing its performance on synthetic surgery datasets, which contain various levels of corruption and perturbations.

2 Experimental Settings

Datasets. We have employed two classical datasets in endoscopic surgical instrument segmentation, i.e., EndoVis17 [2] and EndoVis18 [1]. For the EndoVis17 dataset, unlike previous works [5, 13, 20] which conduct 4-fold cross-validation for training and testing on the 8 \(\times \) 225-frame released training data, we report SAM’s performance directly on all eight sequences (1–8). For the EndoVis18 dataset, we follow the dataset split in ISINet [5], where sequences 2, 5, 9, and 15 are utilized for evaluation.

Prompts. The original EndoVis datasets [1, 2] do not have bounding boxes or point annotations. We have labeled the datasets with bounding boxes for each instrument, associated with corresponding class information. Additionally, regarding the single-point prompt, we obtain the center of each instrument mask by simply computing the moments of the mask contour. Since SAM [14] only predicts binary segmentation masks, for instrument-wise segmentation, the output instrument labels are assigned inherited from the input prompts.

Metrics. The IoU and Dice metrics from the EndoVis17 [2] challengeFootnote 1 is used. Specifically, only the classes presented in a frame are considered in the calculation for instrument segmentation.

Comparison Methods. We have involved several classical and recent methods, including the vanilla UNet [17], TernausNet [20], MF-TAPNet [13], Islam et al. [10], Wang et al. [21], ST-MTL [11], S-MTL [19], AP-MTL [12], ISINet [5], TraSeTR [24], and S3Net [3] for surgical binary and instrument-wise segmentation. The ViT-H-based SAM [14] is employed in all our investigations except for the finetuning experiments. Note that we cannot provide an absolutely fair comparison because existing methods do not need prompts during inference.

Table 1. Quantitative comparison of binary and instrument segmentation on EndoVis17 and EndoVis18 datasets. The best and runner-up results are shown in bold and underlined.

3 Surgical Instruments Segmentation with Prompts

Implementation. With bounding boxes and single points as prompts, we input the images to SAM [14] to get the predicted binary masks for the target objects. Because SAM [14] can not provide consistent categorical information. We compromise to use the class information from the bounding boxes directly. In this way, we derive instrument-wise segmentation while bypassing the possible errors from misclassifications, an essential factor affecting instrument-wise segmentation accuracy.

Results and Analysis. As shown in Table 1, with bounding boxes as prompts, SAM [14] outperforms previous unprompted supervised methods in binary and instrument-wise segmentation on both datasets. However, with single points as prompts, SAM [14] degrades a lot in performance, indicating its limited ability to segment surgical instruments from weak prompts. This reveals the performance of the SAM closely relies on prompt quality. For complicated surgical scenes, SAM [14] still struggles to produce accurate segmentation results, as shown in columns (a) to (l) of Fig. 1. Typical challenges, including shadows (a), motion blur (d), occlusion (b, g, h), light reflection (c), insufficient light (j, l), over brightness (e), ambiguous suturing thread (f), instrument wrist (i), and irregular instrument pose (k), all lead to unsatisfied segmentation performance.

Fig. 1.
figure 1

Qualitative results of SAM on various challenging frames. Red rectangles highlight the typical challenging regions which cause unsatisfactory predictions. (Color figure online)

Table 2. Quantitative results on various corrupted EndoVis18 validation data.

4 Robustness Under Data Corruption

Implementation. Referring to the robustness evaluation benchmark [7], we have evaluated SAM [14] under 18 types of data corruptions at 5 severity levels following the official implementationsFootnote 2 with box prompts. Note that the Elastic Transformation has been omitted to avoid inconsistency between the input image and associated masks. The adopted data corruption can be allocated into four distinct categories of Noise, Blue, Weather, and Digital.

Fig. 2.
figure 2

Qualitative results of SAM under 18 data corruptions of level-5 severity.

Results and Analysis. The severity of data corruption is directly proportional to the degree of performance degradation in SAM [14], as depicted in Table 2. The robustness of SAM [14] may be influenced differently depending on the nature of the corruption present. However, in most scenarios, SAM’s performance diminishes significantly. Notably, JPEG Compression and Gaussian Noise have the greatest impact on segmentation performance, whereas Brightness has a negligible effect. Figure 2 presents one exemplar frame in its original state alongside various corrupted versions at a severity level of 5. We can observe that SAM [14] suffers significant performance degradation in most cases.

5 Automatic Surgical Scene Segmentation

Implementation. Without prompts, SAM [14] can also facilitate automatic mask generation (AMG) for the entire image. For naive investigation of the automatic surgical scene segmentation results, we use the default parameters from the official implementationFootnote 3 without further tuning. The colors of each segmented mask are randomly assigned because SAM [14] only generates binary masks for each object.

Results and Analysis. As shown in Fig. 3, in surgical scene segmentation of EndoVis18 [1] data, SAM [14] can produce promising results on simple scenes like columns (a) and (f). But it encounters difficulties when applied to more complicated scenes, as it struggles to differentiate between the entirety of instrument articulating parts accurately and to identify discrete tissue structures as interconnected units. As a foundation model, SAM [14] still lacks comprehensive awareness of objects’ semantics, especially in downstream domains like surgical scenes.

Fig. 3.
figure 3

Unprompted automatic mask generation for surgical scene segmentation.

6 Parameter-Efficient Finetuning with Low-Rank Adaptation

With the rapid emergence of foundational and large AI models, utilizing the pretrained models effectively and efficiently for downstream tasks has attracted increasing research interest. Although SAM [14] has shown decent segmentation performance with prompts and can cluster objects in surgical scenes, we seek to finetune and adapt it to make it capable of traditional unprompted multi-class segmentation pipeline - take one image as input only, and predict its segmentation mask with categorical labels.

Fig. 4.
figure 4

Overall architecture of our SurgicalSAM.

Implementation. To efficiently finetune SAM [14] and enable it to support multi-class segmentation without relying on prompts, we consider utilizing the strategy of Low-rank Adaptation (LoRA) [9] and also adapting the original mask decoder to output categorical labels. Taking inspiration from SAMed [23], we implement a modified architecture as shown in Fig. 4, whereby the pretrained SAM image encoder maintains its frozen weights \(W_{enc}\) during finetuning while additional light-weight LoRA layers are incorporated for updating purposes. In this way, we can not only leverage the exceptional feature extraction ability of the original SAM encoder, but also gradually capture the surgical data representations and store the domain-specific knowledge in the LoRA layers parameter-efficiently. We denote this modified architecture as “SurgicalSAM”. With an input image x, we can derive the image embedding \(h_{image}\) following

$$\begin{aligned} h_{image} = W_{enc}x + \varDelta W x, \end{aligned}$$
(1)

where \(\varDelta W\) is the weight update matrix of LoRA layers. Then we can decompose \(\varDelta W\) into two smaller matrices: \(\varDelta W = W_A W_B\), where \(W_A\) and \(W_B\) are \(A \times r\) and \(r \times B\) dimensional matrices, respectively. r is a hyper-parameter that specifies the rank of the low-rank adaptation matrices. To maintain a balance between model complexity, adaptability, and the potential for underfitting or overfitting, we empirically set the rank r of \(W_A\) and \(W_B\) in the LoRA layers to 4.

Table 3. Quantitative evaluation of SurgicalSAM under data corruption.
Fig. 5.
figure 5

Qualitative comparison of our SurgicalSAM with the original SAM.

During the unprompted automatic mask generation (AMG), the original SAM uses fixed default embeddings \(h_{default}\) for the prompt encoder with weights \(W_{prompt}\). We adopted this strategy and updated the lightweight prompt encoder during finetuning, as shown in Fig. 4. In addition, we modified the segmentation head of the mask decoder \(W_{dec}\) to allow for the production of predictions for each semantic class. In contrast to the binary ambiguity prediction of the original mask decoder of SAM, the modified decoder predicts each semantic class of \(\hat{y}\) in a deterministic manner. In other words, it is capable of semantic segmentation beyond binary segmentation (Fig. 5).

We adopt the training split of the Endo18 dataset for finetuning and test with the validation split, as other works reported in Table 1. Following SAMed [23], we adopt the combination of the Cross Entropy loss \(L_{CE}\) and Dice loss \(L_{Dice}\) which can be expressed as

$$\begin{aligned} L = \lambda L_{Dice} + (1-\lambda ) L_{CE}, \end{aligned}$$
(2)

where \(\lambda \) is a weighting coefficient balancing the effects of the two losses. We empirically set \(\lambda \) as 0.8 in our experiments. Due to resource constraints, we utilize the ViT_b version of SAM and finetuning on two RTX3090 GPUs. The maximum epochs are 160, with a batch size 12 and an initial learning rate of 0.001. To stabilize the finetuning process, we apply warmup for the first 250 iterations, followed by exponential learning rate decay. Random flip, rotation, and crop are applied to augment the training images and avoid overfitting. The images are resized to \(512 \times 512\) as model inputs. Besides, we use AdamW [15] optimizer with a weight decay of 0.1 to update model parameters.

Results and Analysis. After naively finetuning, the SurgicalSAM model can manage the instrument-wise segmentation without reliance on prompts. With further tuning of hyper-parameters like the learning rate, the batch size, and the optimizer, SurgicalSAM can achieve \({\textbf {71.38}}\%\) mIoU score on the validation split of the Endo18 dataset, which is on par with the state-of-the-art models in Table 1. Since other methods in Table 1 are utilizing temporal and optical flow information as supplement [5], or conducting multi-task optimization [3, 24], the results of our image-only and single-task architecture SurgicalSAM are promising. Besides, the encoder backbone we finetuned is the smallest ViT_b due to limited computational resources. We believe the largest ViT_h backbone can yield much better performance. Compared with the original SAM, our new architecture is of great practical significance as it can achieve semantic-level automatic segmentation. Moreover, the additionally trained parameters are only 18.28 MB, suggesting the efficiency of our finetuning strategy.

Furthermore, we have evaluated the robustness of SurgicalSAM in the face of data corruption using the EndoVis18 validation dataset. As shown in Table 3, the model’s performance exhibits a significant degradation when subjected to various forms of data corruption, particularly in the case of Blur corruption.

7 Conclusion

In this study, we explore the robustness and zero-shot generalizability of the SAM [14] in the field of robotic surgery on two robotic instrument segmentation datasets of MICCAI EndoVis 2017 and 2018 challenges, respectively. Extensive empirical results suggest that SAM [14] is deficient in segmenting the entire instrument with point-based prompts and unprompted settings, as clearly shown in Fig. 1 and Fig. 3. This implies that SAM [14] can not capture the surgical scenes precisely despite yielding surprising zero-shot generalization ability. Besides, it exhibits challenges in accurately predicting certain parts of the instrument mask when there are overlapping instruments or only with a point-based prompt. It also fails to identify instruments in complex surgical scenarios, such as blood, reflection, blur, and shade. Moreover, we extensively evaluate the robustness of SAM [14] with a wide range of data corruptions. As indicated by Table 2 and Fig. 2, SAM [14] encounters significant performance degradation in many scenarios. To shed light on adapting SAM for surgical tasks, we fine-tuned the SAM using LoRA. Our fine-tuned SAM, i.e., SurgicalSAM, demonstrates the capability of class-wise mask prediction without any prompt.

As a foundational segmentation model, SAM [14] shows remarkable generalization capability in robotic surgical segmentation, yet it still suffers performance degradation due to downstream domain shift, data corruptions, perturbations, and complex scenes. To further improve its generalization capability and robustness, a broad spectrum of evaluations and extensions remains to be explored and developed.