Introduction

In minimally invasive surgery (MIS), endoscopy serves as a pivotal visual aid, facilitating precise lesion observation, diagnosis, and treatment. Nevertheless, the unique imaging environment presents a challenge—harsh specular reflections inevitably occur during the procedure [1]. These reflections not only cause visual disturbances but also impede the performance of computer vision algorithms [2]. Therefore, effective removal of specular reflections from endoscopic images is essential and significant.

Fig. 1
figure 1

Different types of endoscopic specular reflections: a different reflection intensities; b different reflection halos; and c different reflection shapes and sizes

Fig. 2
figure 2

EndoSRR: Overview of the proposed endoscopic specular reflection removal framework

Specular reflection removal typically involves two stages: specular reflection detection and specular reflection region inpainting. In the specular reflection detection stage, traditional methods primarily depend on conventional image processing algorithms, falling into two categories—threshold-based methods and principal component analysis-based methods. Thresholding-based approaches often involve converting the image to HSV/YUV color space and subsequently employing double thresholding or adaptive thresholding to isolate the reflection region [3,4,5,6]. For instance, Arnold et al. [6] employed a detection method founded on a combination of nonlinear filtering and color thresholding. More recently, Li et al. [7] introduced the concept of adaptive robust principal component analysis (AdaRPCA), and Pan et al. [1] proposed the accelerated adaptive non-convex robust principal component analysis (AANC-RPCA) for specular reflection detection. These principal component analysis-based methods typically execute sparse and low-rank decomposition of the endoscopic images to derive reflection masks. However, these conventional algorithms, reliant on fixed thresholds, often exhibit poor generalization ability, struggling to effectively identify reflection regions. The scarcity of specular reflection datasets constrains progress in deep learning methods. Monkam et al. [8] used a hybrid strategy, combining transfer learning and weakly supervised learning for training lightweight U-Net models with inaccurate labels, which, however, struggled with small reflective regions. Ali et al. [9] showed improved detection accuracy using RestNet50 with DeepLabv3+. As depicted in Fig. 1, the diverse forms, shapes, and sizes of specular reflections in endoscopic images pose unresolved challenges for current detection methods, leading to two predominant issues: over-segmentation (false positives) and under-segmentation (false negatives). These challenges are particularly pronounced when the representation of the reflective region closely resembles that of the organ surface tissue.

Furthermore, owing to the substantial variations in texture and color evident in endoscopic images, an unresolved issue arises during the reflective region inpainting stage: the inefficiency in accurately inpainting larger specular reflective regions using both global and local information from the image. Arnold et al. [6] proposed an inpainting technique involving an initial filling of the reflective region based on neighboring pixels and subsequent nonlinear decay along the edges. Meanwhile, principal component analysis-based methods like AdaRPCA [7] and AANC-RPCA [1] utilize the low-rank image derived from matrix decomposition either directly or with outward attenuation as the inpainting result. Additionally, various traditional inpainting methods, including stochastic Bayesian estimation [10], specific Sobolev operators [4], and example-based methods [5], have been explored for handling reflective regions.

Recent studies explored deep learning techniques for reflective region inpainting. Funke et al. [2], Ali et al. [9], and Daher et al. [11] employed a generative adversarial network-based approach. Monkam et al. [8] proposed the GatedResU-Net architecture, achieving more reasonable inpainting results. However, these approaches are marred by issues such as blurriness, a noticeable lack of texture, and an inability to seamlessly integrate with the surrounding texture. Such limitations compromise the meaningfulness of inpainting results for downstream computer vision applications. Consequently, developing a specular reflection removal system capable of accurately detecting reflective regions and realistically inpainting them concurrently poses a formidable challenge.

In this paper, we tackle three key challenges: (1) the scarcity of datasets; (2) the issues of over- and under-segmentation in specular reflection detection; and (3) the suboptimal inpainting results for specular reflection regions. While dataset labeling is a time-consuming and laborious task, we present a method for semi-automatic labeling of endoscopic specular reflection regions, resulting in a weakly labeled dataset. The proposed EndoSRR endoscope specular reflection removal framework, depicted in Fig. 2, utilizes this created dataset to fine-tune the adapted segment anything model (SAM). The ensuing specular reflection masks serve as input for the resolution-robust large mask inpainting model (LaMa) to effectively inpaint specular reflection regions. Finally, we introduce a simple yet effective optimization strategy to further refine the specular reflection removal results. Through both qualitative and quantitative analysis and comparison, the outcomes of our proposed methods are optimal, excelling in both specular reflection detection and reflection region inpainting. Additionally, we directly apply the inpainting results to segment anything model, as well as visualize the inpainting results in 3D using the depth information provided by the SCARED-2019 dataset. The experimental results highlight that effective endoscopic specular reflection removal not only enhances downstream tasks but also alleviates the visual fatigue experienced by surgeons during prolonged surgical procedures.

The code is available at https://github.com/Tobyzai/EndoSRR.

Method

Creation of endoscopic specular reflection dataset

To fine-tune the SAM-adapter and acquire precise reflection masks, an endoscopic specular reflection weakly labeled dataset was meticulously crafted. Illustrated in Fig. 3, the process entails three main steps: (a) global k-means clustering of the image for initial coarse filtering of reflective regions, (b) local k-means clustering of the image to further refine reflective regions and encompass more regular halos via a dilation operation, and (c) manual outlining to meticulously refine irregular halos. The envisioned contribution of this dataset is to advance the field of reflection removal.

Fig. 3
figure 3

Process for creating endoscopic specular reflection weakly labeled dataset

SAM-adapter for reflection detection

Capitalizing on SAM’s capabilities derived from massive corpora for the specific task of specular reflection detection prompts the question of how to effectively leverage them [12]. An efficient solution is to integrate Explicit Visual Prompts into the SAM model [13, 14]. In this study, we employ SAM-adapter to segment reflection regions based on a purpose-created, small dataset. Illustrated in Fig. 4, SAM-adapter comprises four modules.

Fig. 4
figure 4

SAM-adapter architecture consists of 4 modules: high-frequency components tune, patch embedding tune, adapter and SAM

Module-1: High-frequency components tune (First Column on the Top Left). Utilizing fast Fourier transform, the high-frequency component \(I_\textrm{hfc}\) is extracted from the image, resulting in a small patch \(I^p_\textrm{hfc}\) with the same format as SAM. To align with SAM’s dimensions, the patch undergoes projection to yield features \(F_\textrm{hfc}\) using a linear layer \({L_\textrm{hfc}}\). The primary objective of this module is to instill invariance in the pre-trained model to endoscopic image features through data augmentation. The process is defined as follows:

$$\begin{aligned} F_\textrm{hfc} = {L_\textrm{hfc}}(I^p_\textrm{hfc}). \end{aligned}$$
(1)

Module-2: Patch embedding tune (Second Column on the Top Left). This module is tailored to adjust the pre-trained patch embedding, aiming to shift its distribution from the pre-trained dataset to the endoscopic specular reflection dataset. \(I^p\) represents the frozen patch embedding output of SAM, traversing through a linear layer \({L_\textrm{pe}}\) and projecting onto the features \(F_\textrm{pe}\). The operational equation is defined as:

$$\begin{aligned} F_\textrm{pe} = {L_\textrm{pe}}(I^p). \end{aligned}$$
(2)

Module-3: Adapter (Top Right). The purpose of each adapter is to dynamically integrate the features \(F_\textrm{hfc}\) and \(F_{pe}\) using their respective multilayer perceptron \({\mathrm {MLP^i_{tune}}}\), activation functions \({\textrm{GELU}}\) [15], and globally shared multilayer perceptron \({\mathrm {MLP_{up}}}\), and attach the resulting output visual prompts \(P^i\) to their corresponding transformer layers. For the i-th adapter, the process is defined as follows:

$$\begin{aligned} P^i = {\mathrm {MLP_{up}}}({\textrm{GELU}}({\mathrm {MLP^i_{tune}}}(F_\textrm{hfc} + F_\textrm{pe}))). \end{aligned}$$
(3)

Module-4: SAM (Bottom). SAM [13] serves as the backbone for the endoscopic specular reflection segmentation network, comprising an encoder and decoder. The encoder remains frozen, with each of its layers embedded with visual prompts \(P^i\) from adapters. Contrarily, the decoder does not intake any form of prompt information.

By complementing each other with the four modules of the SAM-adapter, task-specific knowledge is integrated with the general knowledge gained during the training process, thereby enhancing the utility of SAM for specular reflection detection task. The results of specular reflection segmentation are detailed in “Reflection detection” section.

LaMa for reflective region inpainting

After the detection of endoscopic specular reflections, the subsequent step involves inpainting the reflected region using LaMa, a state-of-the-art inpainting technique [16]. Illustrated in Fig. 5, the LaMa process comprises the following equations:

$$\begin{aligned} {i}'= & {} {\text {stack}}(i, m), \end{aligned}$$
(4)
$$\begin{aligned} {\hat{i}}= & {} f_{\theta } ({i}'), \end{aligned}$$
(5)
$$\begin{aligned} {\mathcal {L}}_\textrm{final}= & {} \kappa {\mathcal {L}}_\textrm{Adv} + \alpha {\mathcal {L}}_\textrm{HRFPL} + \beta {\mathcal {L}}_\textrm{DiscPL} + \gamma R_{1}. \end{aligned}$$
(6)
Fig. 5
figure 5

LaMa architecture: mask and original image stacked as input to get reflection-free image

Initially, a 3-channel endoscopic image i and a 1-channel reflection mask m are stacked to form a 4-channel input \({i}'\). Subsequently, the feed-forward inpainting network \(f _{\theta } (\cdot )\), which encompasses downscale, fast Fourier convolution (FFC) [17], and upscale, processes the input \({i}'\) in a fully convolutional manner, yielding the inpainted 3-channel image \({\hat{i}}\). Finally, the network parameters are inferred and optimized based on the \({\mathcal {L}}_\textrm{final}\) loss. Here, \({\mathcal {L}}_\textrm{Adv}\) and \({\mathcal {L}}_\textrm{DiscPL}\) contribute to generating natural-looking local details, \({\mathcal {L}}_\textrm{HRFPL}\) oversees the global structure and supervised signal consistency, \(R_{1}\) is the gradient penalty, and \(\kappa \), \(\alpha \), \(\beta \), \(\gamma \) are the weight values. Additional details about the LaMa inpainting model can be found in [16].

Given the absence of reflection-free endoscopic images, we address this challenge using a technique grounded in full transfer learning. Experimental findings reveal that the pre-trained model adeptly inpaints reflections in large regions effectively. The detailed inpainting results are presented in “Reflective region inpainting” section.

Optimization strategy

To address the limitations associated with weakly labeled datasets, we introduce an innovative dual pre-trained models iterative optimization strategy (DPMIO). The optimization process, detailed in Fig. 6’s pseudo-code, begins with the original image. SAM-adapter produces the reflection mask, and, with the original, LaMa generates an inpainted image. This inpainted image is then iteratively fed back into SAM-adapter for mask refinement and LaMa for updated inpainting until specified criteria are met. The optimization strategy employs parameters \(\mu \) (1.5e\(-\)4) and iter (5).

Fig. 6
figure 6

Optimization process for reflection removal result

Despite its simplicity, this optimization strategy proves highly effective in refining inpainting results. Illustrated in Fig. 6, this strategy progressively enables the detection of challenging reflection regions, such as smaller or lighter reflections, leading to a more natural and plausible inpainting outcome. Ablation experiment is presented in “Reflective region inpainting” section.

Experiments

Implementation

The proposed method, EndoSRR, was implemented using PyTorch on an NVIDIA RTX 3090 GPU. The algorithm by Arnold et al. [6] was implemented in MATLAB R2021b on a system equipped with an AMD Ryzen 7 6800 H 3.20 GHz processor, while the AdaRPCA [7] and AANC-RPCA [1] algorithms were implemented using C++ on QT Creator 10.0.2. It is noteworthy that, apart from the aforementioned three endoscopic specular reflection removal algorithms, none of the other algorithms has released or shared their available implementations.

In the reflection detection stage, all modules of SAM-adapter are tunable, excluding the SAM encoder, which remains frozen, as illustrated in Fig. 4. ViT-B served as the pre-training parameter, AdamW as the optimizer, binary cross-entropy (BCE) and IoU loss as the loss functions, with a learning rate of 2e-4, and the model underwent fine-tuning for 300 epochs. The Big-LaMa pre-trained model was employed for inpainting the reflection region, and \(\kappa \), \(\alpha \), \(\beta \), and \(\gamma \) were set to 10, 30, 100, and 0.001, respectively.

Datasets

Utilizing all the keyframes from the SCARED-2019 [18] dataset, we annotated the specular reflection dataset. Datasets 1 to 7 constitute the training set with 70 images, while datasets 8 and 9 form the testing set comprising 20 images. In the reflection detection stage, images are resized from \(1280 \times 1024\) to \(1024 \times 1024\) to match the SAM model, while in the inpainting stage, the original image size is retained.

Results and comparisons

Reflection detection

Quantitative comparison

Table 1 demonstrates that our proposed reflection detection method surpasses other methods in segmentation evaluation metrics, including accuracy, IoU, precision, and E-measure. The E-measure is defined as \(\text {E-measure}=\frac{1}{w\times h}\sum _{x=1}^{w}\sum _{y=1}^{h}\phi _{FM}(x,y)\), where \(\phi _{FM}\) is the enhanced alignment matrix, and h and w are the height and width of the map, respectively. Additional information on the E-measure can be found in [19]. In comparison to the state-of-the-art method AANC-RPCA [1], the proposed method achieves a higher IoU of 0.5888 compared to 0.5223. IoU’s advantage lies in its ability to penalize both false negatives and false positives, indicating that the proposed method effectively addresses the challenges of over- and under-segmentation, resulting in more accurate segmentation outcomes.

Table 1 Quantitative comparison of reflection detection

Qualitative comparison

As depicted in Fig. 7, the method proposed by Arnold et al. [6] excels in detecting smaller reflection regions but struggles with larger reflection regions, contributing to an under-segmentation problem. Conversely, AdaRPCA [7] and AANC-RPCA [1] tend to misclassify lighter-colored organ tissues as reflections, resulting in an over-segmentation problem. In contrast, the proposed method demonstrates a balanced approach, mitigating the challenges of both under- and over-segmentation.

Fig. 7
figure 7

Qualitative comparison of reflection detection between the proposed method with Arnold et al [6], AdaRPCA [7] and AANC-RPCA [1]

Table 2 Quantitative comparison of inpainting results with reference assessment metrics
Table 3 Quantitative comparison of inpainting results with non-reference assessment metric

Reflective region inpainting

Quantitative comparison

For a meaningful quantitative comparison, each method conducts inpainting on identical non-reflective regions, and the inpainting results are assessed using peak signal-to-noise ratio (PSNR), structured similarity indexing method (SSIM), and mean square error (MSE) metrics with the original image as a reference. Table 2 illustrates that the proposed method outperforms other methods significantly across all evaluation metrics. This superiority is attributed to the robust generalization ability of the model, enabling the combination of local and global information for optimal restoration of missing information in the image.

We further evaluated reflective region inpainting using the blind image inpainting quality assessment metric (BIIQA) [20], emphasizing local feature continuity. BIIQA is defined as \({\textrm{BIIQA}} = \alpha {\bar{Q}}{e} + \beta {\bar{Q}}{t} + \gamma {\bar{Q}}{s}\), where \(\alpha \), \(\beta \), and \(\gamma \) represent the percentage of edge patches, texture patches, and smooth patches, respectively. Additionally, \({\bar{Q}}{e}\), \({\bar{Q}}{t}\), and \({\bar{Q}}{s}\) denote the mean values of the edge, texture, and smooth scores, respectively. Table 3 confirms our method’s superior performance. In the ablation experiments, our optimization strategy (DPMIO) improves inpainting (BIIQA: 0.686 to 0.693) but extends runtime (0.92s to 2.61s). While real-time capability is a limitation, our method remains the quickest.

Qualitative comparison

For qualitative comparison, all methods perform inpainting on the specified reflection regions. As shown in Fig. 8, the interpolation-based method [6] and the low-rank decomposition method [1, 7] exhibit ineffective inpainting of the reflection region. The inpainting traces of these methods are conspicuous, and the results appear blurred. In contrast, the proposed method yields results that seamlessly blend with the background texture, presenting a more natural appearance while effectively minimizing the loss of organ texture information. As depicted in the final row of Fig. 8, our method exhibits a limitation in effectively addressing subtle and faintly reflective regions, presenting a notable challenge for future improvements.

Fig. 8
figure 8

Qualitative comparison of reflective region inpainting between the proposed method with Arnold et al [6], AdaRPCA [7] and AANC-RPCA [1]

Fig. 9
figure 9

Endoscopic specular reflection removal for Segmentation and 3D-Visualization

Application of specular reflection removal

Reasonable and natural specular reflection removal results are more helpful for downstream tasks. As shown in Fig. 9, it helps to enhance the segmentation accuracy of SAM across diverse tissue regions, as well as to improve the 3D visualization effect for better application in VR or AR based surgical navigation systems.

Discussion and conclusion

In this paper, we introduce EndoSRR, a novel algorithm for endoscopic specular reflection removal empowered by a large-scale model. Our approach begins with the creation of a weakly labeled dataset using a semi-automatic labeling tool. Subsequently, fine-tuning of the SAM-adapter accurately detects reflective regions. The reflective areas are then inpainted and optimized through a combination of the state-of-the-art inpainting technique LaMa and a proposed optimization strategy. We validate the significant benefits of effective reflection removal for advancing downstream tasks and mitigating intraoperative visual fatigue in segmentation applications and 3D visualization applications. Our contributions include:

  • Creation of weakly labeled dataset: We introduce a weakly labeled dataset, addressing the scarcity of endoscopic specular reflection datasets. This contribution is poised to advance the deep learning domain in the specular reflection removal task.

  • Optimization strategy: We propose a simple yet effective optimization strategy, enhancing the naturalness and texture realism of the reflection removal results. This strategy, with its potential for application to similar tasks, serves as a valuable contribution.

  • Big model and transfer learning: Our work pioneers specular reflection removal on a small-scale dataset by leveraging big models and transfer learning, resulting in significantly improved removal results. This approach is particularly informative for the data-starved medical field.

Despite achieving superior results in both reflection detection and reflection region inpainting compared to state-of-the-art methods, our proposed EndoSRR method has certain limitations:

  • Color and texture restoration: The algorithm struggles to fully restore the real color and texture information of the image, a challenge shared by existing methods in the field.

  • Limited weakly labeled datasets: Due to the complexity and distribution scattering of endoscopic specular reflections, our weakly labeled datasets are limited in number. Further improvements in segmentation results and rigorous quantitative evaluation are necessary.

  • Real-time performance: The algorithm does not achieve real-time performance, necessitating optimization and enhancements for practical use.

In conclusion, endoscopic specular reflection removal remains a formidable challenge. This work aims to contribute to the ongoing development of this field, ultimately enhancing the performance of computer vision downstream tasks and advancing the safety of surgical procedures. In forthcoming research endeavors, our aim is to expand the dataset of reflection masks, enabling precise localization of reflection regions in every frame of the video. This augmentation will facilitate the utilization of temporal information across various frames within the video sequence, enhancing the ability to restore the authentic texture and color details of the reflection regions.