Introduction

Lung diseases seriously affect human health. Take lung cancer as an example; it is the leading cause of cancer-related deaths worldwide[1]. Video-assisted Thoracic Surgery (VATS) is a reliable, precise, and safe minimally invasive treatment method for lung cancer. Doctors use a single-lens scope to observe the patient’s condition and provide visual information during surgery[2, 3]. However, VATS also has its disadvantages, such as limited visibility and an inability to accurately position the scope. Augmented reality navigation systems based on computer vision can help doctors address these issues. Still, due to problems like changes in lighting and sparse features, accurately and densely reconstructing lung structures is not a straightforward task.

Three-dimensional reconstruction from monocular video has been a long-standing research topic[4,5,6]. Currently, deep learning methods are the primary research direction for this issue. Eigen et al.[7], Xu et al.[8], Cao et al.[9], and Fu[10] have used fully supervised convolutional neural networks for deep estimation. However, fully supervised three-dimensional reconstruction is challenging for endoscopy since obtaining true depth maps corresponding to endoscopy images is difficult.

Therefore, self-supervised monocular depth estimation and pose estimation have more research value (Luo et al.[11], Ranjan et al.[12], Casser et al.[13]). Self-supervised methods simultaneously estimate scene depth and camera pose and use the obtained results to synthesize frames based on distortion. Finally, the difference between the target frame and the synthesized frame is calculated as the training supervision signal. However, network structures used in general scenes are not suitable for endoscopy scenes because the frame-to-frame photometric consistency assumption does not always hold in endoscopy videos.

Several methods have been developed to address the issue of inconsistent lighting. Liu et al.[14] used a multi-view synthesis method to generate sparse depth and camera pose first and then combine them to supervise the depth network. Spancer et al.[15] used learned dense visual representations to enhance the supervision signal when the photometric consistency condition fails. Yang et al.[16] and Ozyoruk et al.[17] used bio-inspired transformers to map the target frame to the same brightness space as the synthesized frame. Recasens et al.[18] combined traditional methods with deep learning methods, using photometric inconsistencies to track camera poses. However, these methods have drawbacks, such as high computational complexity, heavy reliance on visual representations, and the inability to handle extreme lighting changes, among other issues.

In this work, we have established a new monocular thoracoscopic three-dimensional reconstruction framework. Firstly, we address the issue of inconsistent lighting by using optical flow between adjacent frames. Optical flow introduces a generalized dynamic image constraint (GDIC), which includes both geometric and radiometric transformations. These two transformations help increase inter-frame information and compensate for differences in inter-frame brightness. Secondly, to tackle the issue of changes in the appearance of lung tissue during thoracoscopic movement, we have added an attention module to the depth estimation, allowing the network to focus more on regions with relatively rich texture information. Finally, we introduce inter-layer losses between different network layers to prevent gradient vanishing caused by convolutional layers. By supervising intermediate layers, we adequately train shallow convolutional layers and reduce underfitting in low-texture regions. We have used clinical data collected in collaboration with the hospital to validate the model’s accuracy, demonstrating that the model can provide more accurate patient tissue location information to doctors during surgery (Fig. 1).

Fig. 1
figure 1

The architecture of the proposed self-supervised 3D-dimensional reconstruction system

Related work

Fully supervised depth estimation

Deep convolutional networks were first proposed for depth estimation in [19]. Early depth estimation models were trained in a supervised manner using depth sensors. Eigen et al. [7] proposed using a multi-scale network and scale-invariant loss to regress depth from a single static image. Laina et al. [20] introduced a residual fully convolutional network (FCN) architecture for monocular depth estimation, which had a deeper architecture and eliminated post-processing steps. Cao et al. [9] treated depth estimation as a pixel-level classification problem and trained a residual network to predict the category corresponding to each pixel after discretizing depth values. Xu et al. [23] used a Conditional Random Field (CRF) as a depth post-processing module. Fu et al. [10] treated depth estimation as a classification problem and introduced more robustness losses.

However, the endoscopy environment differs from the external environment, and it is challenging to train with abundant accurate RGB-D datasets under full supervision. Using computer-synthesized data has become one approach to address this. Visentini-Scarzanella et al. [21] used CT data and background-free simulated endoscopy videos to train fully supervised deep learning networks. Chen et al. [22] used color images and rendered depth maps to train a fully supervised depth network. Yang et al. [23] simulated the endoscopy imaging process using 3D modeling and rendering tools to achieve full supervision of endoscopy. However, closing the gap between the real domain and the synthetic domain is difficult by simply mimicking appearances, which may result in a performance drop.

Self-supervised depth and ego-motion estimation

Self-supervised networks indirectly train the network through differences in images, such as pixel, predicted depth, and appearance differences between image sequences, thus avoiding the use of depth maps. Initially, self-supervised networks were based on multi-view images. Xie et al. [24] introduced a model with discrete depth for synthetic views. Garg et al. [25] further investigated methods for predicting continuous disparity values, and Godard et al. [26] improved supervised results by adding left-right depth consistency. Various improvements based on multi-view methods include semi-supervised data [27, 28], generative adversarial networks [29, 30], additional consistency [31], temporal information [32,33,34], and real-time usage [35].

Various improvements have been made in the field of self-supervised estimation based on monocular images by researchers who enhanced network structures, loss functions, and more. In addition to predicting depth, self-supervised monocular training also requires the network to estimate endoscope poses between frames, which can be challenging in cases involving object motion. Zhou et al. [36] developed a self-supervised framework that views the depth estimation problem as a warping-based view synthesis task. However, self-supervised frameworks designed for general environments struggle to address issues like inter-frame brightness inconsistencies when applied to endoscopy environments.

Turan et al. [37] introduced research on self-supervised depth and ego-motion estimation in endoscopy scenes. Liu et al. [14] used sparse depth and camera poses generated by a traditional SfM pipeline as supervision, with SfM running as a preprocessing step. Li et al. [38] used Peak Signal-to-Noise Ratio (PSNR) as an additional optimization objective during training. Ozyoruk et al. [17] employed bio-inspired brightness transformers to enhance photometric robustness.

Compared to previous methods, we use optical flow to constrain image photometry, employ attention modules and inter-layer losses to handle non-Lambertian reflection and inter-reflection caused by changes in illumination inside the lung. Based on these improvements, we have established a comprehensive self-supervised framework. Our method is direct and does not require additional auxiliary information, such as CT images or depth maps generated by structured light, and it also does not necessitate multi-view images.

Methodology

In this section, we first introduce the prior knowledge of monocular 3D reconstruction. Then the proposed optical flow-based 3D reconstruction framework is elaborated. The framework consists of three parts: A. Depth estimation network with added attention mechanism, B. Motion estimation network based on optical flow, and C. Loss function. The overall framework is using a self-supervised approach to train the network, which can perform accurate 3D reconstruction of endoscopic scenes.

Self-supervised 3D reconstruction

Self-supervised 3D reconstruction involves two sub-networks: the depth estimation network and the pose estimation network. Unlike fully supervised methods that use real depth and pose as supervision signals, the supervision signal in self-supervised methods comes from view synthesis based on distortions. First, the depth estimation network estimates the pixel depth values of the current frame. Then, using the endoscope’s intrinsic parameters, the pixel points on the 2D plane are projected back into the 3D camera space. The pose estimation network is then used to project the 3D point cloud onto adjacent frames. There are two frames, \(I^t(p)\) and \(I^s(p)\), and the frame transformation relationship is:

$$\begin{aligned} h(p^{s \rightarrow t})=[K\vert 0]M^{s \rightarrow t}\left[ \begin{array}{c} D^tK^{-1}h(p^t)\\ 1 \end{array} \right] \end{aligned}$$
(1)

where \(h(p^{s \rightarrow t})\) and \(h(p^t)\) are the corresponding pixel coordinates on the source frame s and the target frame t, respectively, K represents the camera intrinsic parameters, \(M^{s \rightarrow t}\) represents the motion from the source frame to the target frame, and \(D^t\) represents the depth map of the target frame. With the above equation, the transformation relation equation between the source frame and the target frame can be obtained as follows:

$$\begin{aligned} F_\delta ^{t \rightarrow s}(p)=p^{s \rightarrow t}-p^t \end{aligned}$$
(2)

Depth estimation network

The depth estimation network(DepthNet) consists of an encoder and a decoder that takes the original frame \(I_s\) as the input and the corresponding disparity map \(D_s\) as the output. The network takes as input a 3-channel RGB image with a resolution of 320\(\times\)256 and produces an output with the same resolution as the input. The overall architecture of the network is illustrated in Fig. 2.

Structure of network

The initialization block of the encoder consists of 3 parts: a 3\(\times\)3 convolutional layer with 64 filters (C3\(\times\)3), a batch normalization layer (BN) and a rectified linear unit activation function (ReLU) with a slope of 0.01. After initialization, it passes through the spatial attention module (SAM), the details of the attention module will be introduced in the next section. Then it passes through the max-pooling layer (MP), and finally passes through four ResNet basic blocks, each of which consists of C3\(\times\)3, BN, ReLU, C3\(\times\)3, BN, ReLU and skip connection in turn.

The decoder consists of four basic blocks, each consisting of C3\(\times\)3, exponential linear unit (ELU), C3\(\times\)3, and ELU in turn.

The final output layer consists of two layers interleaved by C3\(\times\)3 and ELU, and finally Sigmoid is used as the activation function. In order to establish the information flow between the encoder and decoder, a skip connection is established from layer i to layer \(n-i\), where n denotes the total number of layers, \(i\in \{0,1,2,3\}\).

Fig. 2
figure 2

Structure of the depth estimation network

Spatial attention module

The spatial attention module guides the depth estimation network by emphasizing pixel texture details with depth differences. The spatial attention module selects a specific region of the input image and processes the features within that region. The module operates as a non-local convolution process, and for any given input \(X\in R^{N \times C \times H \times W}\), the module runs with the equation:

$$\begin{aligned} Z=f(X,X^T)g(X) \end{aligned}$$
(3)

where f represents the pixel-wise relationship between inputs for each pixel X. The non-local operator extracts the relative weights of all positions on the feature map.

In this module a dot product operation is used for the \(\theta\) and \(\phi\) convolution of the max-pooling, which is activated by the ReLU function:

$$\begin{aligned} P=\psi (\sigma _{relu}(\theta (X)\phi (X)^T)) \end{aligned}$$
(4)

where \(\sigma _{relu}\) is the ReLU activation function. The dot product \(\theta (X)\phi (X)^T\) gives a measure of the input covariance, which can be defined as the degree of tendency between two feature maps from different channels. We activate the \(\psi\) convolution operation in the softmax function to perform a matrix multiplication between g and the output of the softmax function. Then, we apply convolution and upsampling to the multiplication result with \(\phi\) to extract the attention map S. Finally, an element sum operation is performed between the attention map S and the input X to generate the output \(E\in R^{N\times C\times H\times W}\).

$$\begin{aligned} S= & {} \phi (\sigma _{softmax}Pg(X)) \end{aligned}$$
(5)
$$\begin{aligned} F= & {} S+X \end{aligned}$$
(6)

where \(\sigma _{softmax}\) denotes the softmax function. A short connection between the input X and output F finalizes the residual learned block operations.

Pose estimation network

Our pose estimation network is primarily based on the design by Shao et al. [39], and it consists of three main components: a motion module, an appearance module, and a correspondence module. The motion module serves as a 6-degree-of-freedom (6DOF) self-motion estimator, taking two consecutive frames as input and outputting a relative pose parameterized by Euler angles and a translation vector. The appearance module is used to predict appearance flow and adjusts brightness conditions through a brightness calibration process. The correspondence module handles the automatic registration step.

The encoder model of the motion module network is similar to the one described in Sect. 3.2.1. It begins with an initialization block, but it’s important to note that there is no attention module at this stage. Discussion about the effects of the attention module will be covered in chapter 4. The encoder then goes through a max-pooling layer, followed by four ResNet basic blocks. The decoder consists of three basic blocks and one C3\(\times\)3. Each basic block is composed of a C3\(\times\)3 and ReLU in sequence.

The appearance module network has a structure similar to the depth network. During the encoding phase, a concatenated image pair passes through convolution block layers with a stride of 2, forming a five-level feature pyramid. Jump connections then propagate the pyramid’s features to the decoding phase. In the decoding phase, upsampling layers, concatenated feature maps, 3\(\times\)3 convolution layers with ELU activation, and the estimation layer are sequentially connected until the network’s output reaches the highest resolution. Apart from the estimation layer, the correspondence module network maintains the same architecture as the appearance module network.

Loss function

The loss function consists of three parts, the residual-based smoothness loss, auxiliary loss, and smoothness loss. In order to fully utilize information across different levels, we also introduce inter-layer losses when calculating the loss. The single-layer loss function is as follows:

$$\begin{aligned} L=\lambda _1l_{rs}+\lambda _2l_{ax}+\lambda _3l_{es} \end{aligned}$$
(7)

Smoothness loss based on residuals

It penalizes the first-order gradients. It uses the output of the appearance module network in conjunction with the original image to calculate the result, as follows:

$$\begin{aligned} l_{rs}=\sum _p\vert \nabla A_\delta (p)\vert \end{aligned}$$
(8)

where \(A_\delta (p)\) represents the constraint of the appearance module on the light intensity.Additionally, the residual-based gradient is used to emphasize regions with sharp brightness changes:

$$\begin{aligned} l_{rs}=\sum _p\vert \nabla A_\delta (p)\vert \times e^{-\nabla \vert I^t(p)-I^{s\rightarrow t}(p)\vert } \end{aligned}$$
(9)

Auxiliary loss

\(l_{ax}\) provides the supervisory signal for the appearance module:

$$\begin{aligned} l_{ax}=\sum _pM(p)\times \Phi (I^{s\rightarrow t}(p),I^t(p)+A_\delta (p)) \end{aligned}$$
(10)

where \(I^{s\rightarrow t}(p)\) is reconstructed from optical flow and spatial converters.M(p) represents the mask for objects falling within the visible range.

Edge-aware smoothness loss

The smoothness property of the depth map is enforced using \(l_{es}\) with the following equation:

$$\begin{aligned} l_{es}=\sum _p\vert \nabla D(p)\vert *e^{-\nabla \vert I^t(p)\vert } \end{aligned}$$
(11)

Inter-layer loss

Due to the encoding and decoding processes in the network, different levels focus on different image ranges. If only the results from the last layer of the decoder are used to compute the loss function, some local information may be lost. Therefore, we introduced inter-layer loss by adding additional branches in the decoder to compute the loss for each layer. The structure of the inter-layer loss is reflected in Fig. 2. After adding inter-layer loss, the formula for the total loss function is as follows, where \(k_i\) represents the weight parameters of the i layer and n represents the number of decoder blocks.:

$$\begin{aligned} L=\sum _{i=1}^nk_i(\lambda _1l_{rs}+\lambda _2l_{ax}+\lambda _3l_{es}) \end{aligned}$$
(12)

Experiments

To evaluate the depth estimation accuracy of the proposed framework and to investigate different design considerations, we conduct extensive experiments in this section.

Dataset

  • SCARED[40].The SCARED dataset was collected from fresh pig cadaver abdominal dissections and contains 35 endoscopic videos as well as realistic depth and pose information.

  • Clinical Data Set. In collaboration with the hospital, we used Olympus’ endoscope, which comes with a video recording function, when performing lung surgery, and asked the surgeon to film the patient’s thoracoscopy from as many angles as possible before the surgery. Included is endoscopic video of 11 complete surgeries.

For depth estimation and pose estimation, we conducted extensive experiments on clinical datasets. Faced with the challenge of not having access to ground truth depth and endoscope motion paths in clinical datasets, we utilized normalized local cross-correlation as a quantitative evaluation metric, which will be detailed in Section 4.2. Then, to demonstrate the generalization capability of this model, we will use the model trained on clinical datasets, without any adjustments, directly for experiments on the SCARED dataset.

Training parameter

Training: Our framework is implemented in the Pytorch library and trained on a single NVIDIA RTX 3060. We use the Adam optimizer, where \(\beta _1=0.9\),\(\beta _2=0.99\), the batch size is 4,\(\alpha =0.85\),\(\lambda _1=0.1\),\(\lambda _2=0.01\),\(\lambda _3=0.001\). For inter-layer losses, the number of decoders \(n=5\), and the loss factor \(k_i=0.2\) for each layer. We employed a pre-trained ResNet-18 encoder on ImageNet. The input image resolution for the entire network is 320\(\times\)256. In each epoch, we divided the training into two stages. First, we trained the correspondence module network using edge-aware smoothness loss. After backpropagation and parameter updates, we proceeded to train the depth network, motion module network, and appearance module network. A total of 20 epochs were trained. In these two stages, the initial learning rate was set to 1e-4 and was scaled by a factor of 0.1 after 10 epochs.

Performance metrics: In response to the challenge of not having access to real depth and motion trajectories in clinical data, we use the cross-correlation coefficient [41, 42] to quantitatively evaluate the model’s performance. This metric has been employed in medical image registration research to measure the similarity between images before and after registration. We adapt this metric for monocular endoscope 3D reconstruction. After inputting the original image s into the model, we obtain estimated depth and pose. Then, using depth and pose, we warp s to the target image t, resulting in a synthesized frame \(\hat{t}\). We then calculate the cross-correlation coefficient between t and \(\hat{t}\), and after normalization, a coefficient closer to 1 indicates greater similarity between the synthesized frame and the target frame, reflecting better model performance. The formula for the cross-correlation coefficient is as follows:

$$\begin{aligned} CC(s,t)=\sum _{p\in \Omega }\frac{\left( \sum _{p_i}(s(p_i)-\hat{s}(p))(t(p_i)-\hat{t}(p))\right) ^2}{\left( \sum _{p_i}(s(p_i)-\hat{s}(p)) \right) \left( \sum _{p_i}(t(p_i)-\hat{t}(p)) \right) } \end{aligned}$$
(13)

where p denotes the pixel point on the image, and \(\hat{s}\) and \(\hat{t}\) are the synthetic frames obtained from the estimated depth and bit pose warping of the original image s and the target image t, respectively.

Table 1 The error and accuracy metrics for depth evaluation

Additionally, to validate the model’s generalization, experiments were conducted on the SCARED dataset using other evaluation metrics, as specified in Table 1. In this table, d and \(d^*\) represent the predicted depth values and the corresponding ground truth values, and D represents a set of predicted depth values. During validation, we use the median scaling method to scale the predicted depth values, and the formula for this scaling is as follows:

$$\begin{aligned} D_{scaled}=D_{pred}*(median(D_gt)/median(D_{pred})) \end{aligned}$$
(14)

On the SCARED dataset, the depth maps are scaled proportionally with an upper limit of 150 mms. We have chosen 150 mms as the scaling limit.

For pose estimation, we evaluate using the Absolute Trajectory Error (ATE) [43], as well as the mean and standard deviation of angle errors.

Quantitative evaluation of the cross-correlation coefficient

We evaluated the depth estimation accuracy of our framework against several typical self-supervised methods used for endoscopy, including EndoSLAM[17], Endo-Depth-and-Motion[18], and AF-SFM[39]. We sliced the clinical data collected from hospital surgeries and organized 6370 endoscopic RGB images from 8 surgeries. We selected six of these surgeries containing 4410 images for training. Images from the remaining two surgeries were used to evaluate the training effect. Additionally, to demonstrate the generality of our model, we selected data from 4 scenes in the SCARED dataset and conducted the same tests. The experimental results on both datasets are shown in the following Table 2:

Table 2 Quantitative comparisons of correlation coefficient

In the evaluation based on the cross-correlation coefficient, values closer to 1 indicate a higher degree of similarity between the synthesized frames computed from estimated depth and pose and the target frames. This suggests better accuracy of the method. In the table above, entries 1 and 2 represent results from clinical data, while entries 3 to 6 represent results from the SCARED dataset. Comparing the results among different methods, except for the experiment at entry 5 where our method slightly underperformed compared to AF-SFM, our framework showed a significant advantage in the other test groups. This demonstrates that our framework can more accurately simulate camera motion within the thoracic cavity, enabling more precise 3D reconstruction.

Depth quantitative evaluation of conventional indicators

In addition to comparing using the cross-correlation coefficient, we also conducted a quantitative evaluation on the SCARED dataset using conventional metrics. We directly validated the model trained on clinical data on the SCARED dataset without any fine-tuning. The experimental results are as shown in the following Table 3:

Table 3 Quantitative comparison of depth on the SCARED dataset

From the table above, it is evident that our method achieved better results in various parameters, demonstrating its strong performance across different patients and endoscopes.

Fig. 3
figure 3

Qualitative comparison results of different methods

Figure 3 provides a qualitative comparison. It can be observed that the three methods do not differ significantly on the SCARED dataset, with our method and AF-SFM showing slightly better results. However, on the clinical dataset, the Endo D &M method is no longer capable of accurate depth estimation, while our method can effectively capture the depth values of two protruding regions.

Pose evaluation on the SCARED dataset

It can be seen that, except for the standard deviation, our framework slightly underperforms EndoSLAM. In other metrics, our model outperforms the others. This may be because the EndoSLAM model uses an attention module in its pose estimation network, which allows the pose estimation network to handle regions with missing textures more effectively. We also attempted to incorporate this into our framework, but experimental results showed that the performance gain was not significant. This could be due to the complexity of adding an attention module in the pose network in the challenging clinical environment, which might decrease accuracy rather than improve it (Table 4).

Table 4 Quantitative comparison of motion on the SCARED dataset

Figure 4 shows qualitative experiments for pose estimation. Figure (a) displays the predicted trajectory for the AF-SFM method, where the blue line represents the ground truth trajectory and the green line represents the estimated trajectory. From the image, it is evident that our predictions are closer to the actual results.

Fig. 4
figure 4

Qualitative comparison of motion on the SCARED dataset

Ablation experiments

We divided our framework into four models: the baseline model (ID1) without an attention mechanism and without inter-layer loss, directly computing the loss only on the last layer; the attention model (ID2) with only the attention mechanism and no inter-layer loss; the inter-layer model (ID3) without an attention mechanism, using only inter-layer loss; and our full model (ID4). We compared these models and conducted experiments on both the SCARED dataset and clinical dataset. The Table 5 shows the quantitative depth evaluation results on the SCARED dataset:

Table 5 Experimental study on the ablation of attention modules and inter-layer losses

The table clearly demonstrates that the proposed improvements indeed enhance the accuracy of depth prediction. After incorporating the attention module and inter-layer loss, all the experimental metrics improve, with \(\delta\) values approaching 1. This indicates that more predicted values fall within the range of 75% to 125% of the true values. By comparing the differences between ID2, ID3, and ID1, it is apparent that inter-layer loss has a more significant impact on the entire framework. This may be because inter-layer loss leverages features at different resolutions, providing the framework with a better understanding of both image details and the overall structure.

Table 6 represents six experiments conducted on the clinical dataset, using the cross-correlation coefficient as the evaluation standard.It’s evident that on the clinical dataset, there is a similar trend in the data among different models (ID1-4), leading to the same conclusions as observed on the SCARED dataset.

Table 6 Validation of ablation experiments on a clinical data set

We performed a quantitative evaluation of the pose under the same conditions, and Table 7 shows the results of the quantitative evaluation of the pose:

Table 7 Validation of pose estimation by ablation experiments on the SCARED dataset

In terms of bit pose, there is not much difference between ID1 and ID2, indicating that the attention module in the deep network cannot have a positive effect on bit pose prediction. In contrast, the results of ID3 and ID4 are better than those of ID1 and ID2, with the best result for ID4, which also shows the superiority of our complete model on the bit-pose prediction.

Point clouds for SCARED and clinical datasets

After obtaining depth estimation results and endoscope ego-motion results, along with the camera intrinsic parameters, you can derive a 3D point cloud representation of the endoscopic scene. We use the Truncated Signed Distance Function (TSDF) proposed by Recasens et al. [18] to represent and fuse the depth predictions into a high-quality surface reconstruction. Quantitative evaluation of point cloud reconstruction results may be challenging, but qualitatively, you can observe the reconstruction results in Fig. 5. Figures (a) and (c) are results from the SCARED dataset, while (b) and (d) are from the clinical dataset. It is apparent that our framework can faithfully and accurately reconstruct the 3D structure of the endoscopic scenes in both environments.

Fig. 5
figure 5

Results of dense reconstruction of SCARED dataset and clinical dataset

Conclusion

In this research, a novel self-supervised framework was designed. Leveraging an optical flow network as a base, we incorporated an attention module and introduced inter-layer loss into the network to address the challenges presented by endoscopic clinical datasets, such as severe inter-frame brightness fluctuations and significant scene variations. When faced with the difficulty of obtaining real depth and camera motion in clinical datasets, we used the cross-correlation coefficient as a quantitative evaluation metric. After assessing the performance using the cross-correlation coefficient, our framework exhibited superior mapping relationships between frames, which was attributed to the accuracy of depth estimation and endoscope ego-motion estimation. Finally, we conducted generalization experiments on the SCARED dataset, which also demonstrated the accuracy and generalization capabilities of our network.

Limitations and future work

In current research, researchers have primarily focused on static environments. For instance, datasets like SCARED and SERV-CT are collected on deceased pigs. However, in actual surgeries, the intraoperative environment is subject to real-time changes. Especially in the lung region, the lungs expand and contract periodically with the patient’s breathing. Our study, compared to previous research, has increased scene complexity but hasn’t addressed the dynamic aspects. Establishing dynamic lung models is currently a relatively underdeveloped aspect of endoscopic 3D reconstruction research. While some progress has been made in typical environments, incorporating these achievements into the field of endoscopic 3D reconstruction is a direction for our future research.