Keywords

1 Introduction

Specialists with extensive experience and a specific skill set are required to perform minimally invasive neurosurgeries. During these surgical procedures, differentiation between anatomical structures, orientation and localization is extremely challenging. On one side, excellent knowledge of the specific anatomy as visualized by the image feedback of the surgical device is required. On the other hand, low contrast, non-rigid deformations, a lack of clear boundaries between anatomical structures, and disruptions such as bleeding, make recognition even for experienced surgeons occasionally very challenging. Various techniques have been developed to help neurosurgeons become oriented and to perform surgery. Computer-assisted neuronavigation has been an important tool and research topic for more than a decade [8, 15], but it is still preoperative imaging-based, deeming it unreliable once the arachnoidal cisterns are opened and brain shift occurs [10]. More real-time anatomical guidance can be provided by intraoperative MRI [1, 22, 23] and ultrasound [2, 26]. Orientation has also been greatly enhanced by the application of fluorescent substances such as 5-aminolevulinic acid [7, 24]. Awake surgery [9] and electrophysiological neuromonitoring [3, 19] can also help navigating around essential brain tissue. These techniques work well and rely on physical traits, other than light reflection. However, they are expensive to implement, require the operating surgeon to become fluent in a new imaging modality, and may require temporarily halting the surgery or retracting surgical instruments to get the intra-operative information [21].

Real-time anatomic recognition based on live image feedback from the surgical device has the potential to address these disadvantages and to act as a reliable tool for intraoperative orientation. This makes the application of machine vision algorithms appealing. The concepts of machine vision can likewise be employed in the neurosurgery operating room to analyze the digital image taken by the micro- or endoscope for automatically identifying the visible anatomic structures and mapping oneself on a planned surgical path [21].

Deep learning applications within the operating room have become more prevalent in recent years. The applications include instrument and robotic tool detection and segmentation [20, 28], surgical skill assessment [4], surgical task analysis [13], and procedure automation [25]. Instrument or robotic tool detection and segmentation have been extensively researched for endoscopic procedures owing to the availability of various datasets and challenges [18]. Despite this research on endoscopic videos, the task of anatomic structure detection or segmentation, which could be a foundation for a new approach to neuronavigation, remains relatively unexplored and continues to be a challenge. Note that, anatomy recognition in surgical videos is significantly more challenging than the task of surgical tool detection because of the lack of clear boundaries and differences in color or texture between anatomical structures.

The desire to provide a cheaper real-time solution without relying on additional machines and the improvement of deep learning techniques has driven also the development of vision-based localization methods. Approaches include structure from motion [11] and SLAM [6], such as [14, 16], for 3D map reconstruction based on feature correspondence. Many vision-based localization methods rely on landmarks or the challenging task of depth and pose estimation. The main idea behind these methods is to find distinctive landmark positions and follow them across frames for localization, which negatively impacts their performance owing to the low texture, a lack of distinguishable features, non-rigid deformations, and disruptions in endoscopic videos [16]. These methods have mostly been applied to diagnostic procedures, such as colonoscopy, instead of surgical procedures, which pose significant difficulties. Abrupt changes due to the surgical procedure, e.g., bleeding and removal of tissue, make tracking landmarks extremely challenging or even impossible. Therefore, an alternative solution is required to address these challenges.

In this study, a live image-only deep learning approach is proposed to provide guidance during endoscopic neurosurgical procedures. This approach relies on the detection of anatomical structures from RGB images in the form of bounding boxes instead of arbitrary landmarks, as was done in other approaches [5, 16], which are difficult to identify and track in the abruptly changing environment during a surgery. The bounding box detections are then used to map a sequence of video frames onto a 1-dimensional trajectory, that represents the surgical path. This allows for localization along the surgical path, and therefore predict anatomical structures in forward or backward directions. The surgical path is learned in an unsupervised manner using an autoencoder architecture from a training set of videos. Therefore, instead of reconstructing a 3D environment and localizing based on landmarks, we rely on a common surgical roadmap and localize ourselves within that map using bounding box detections.

The learned mapping rests on the principle that the visible anatomy and their relative sizes are strongly correlated with the position along the surgical trajectory. Towards this end, bounding box detections capture the presence of structures, their sizes, also relative to each other, and constellations. A simplified representation is shown in Fig. 1. Using bounding box detection of anatomical structures as semantic features mitigates the problem of varying appearance across different patients since bounding box composition is less likely to change across patients than the appearance of the anatomy in RGB images. Furthermore, because the considered anatomical structures only have one instance in every patient, we do not need to rely on tracking of arbitrary structures, e.g., areas with unique appearance compared to their surrounding, which further facilitates dealing with disruptions during surgery, such as bleeding or flushing. We applied the proposed approach on the transsphenoidal adenomectomy procedure, where the surgical path is relatively one-dimensional, as shown in Fig. 2, which makes it well-suited for the proof-of-concept of the suggested method.

Fig. 1.
figure 1

Simplified representation of the suggested approach. 1. A sequence of input images is processed to detect bounding boxes of anatomical structures. 2. A neural network encodes the sequence of detections into a latent variable that correlates with the position along the surgical path. 3. Given the current position along the surgical path, an estimation of anatomical structures in the forward or backward directions can be obtained, by extrapolating the current value of the latent variable.

2 Methods

2.1 Problem Formulation and Approach

Let \(\textbf{S}_{t}\) denote an image sequence that consists of endoscopic frames \(\textbf{x}_{t-s:t}\), such as the one shown in Fig. 2, where s represents the sequence length in terms of the number of frames, and \(\textbf{x}_t \in \mathbb {R}^{w \times h \times c}\) is the t-th frame with wh, and c denoting the width, height, and number of channels, respectively. Our main aim is to embed the sequence \(\textbf{S}_{t}\) in a 1D latent dimension represented by the variable \(\textbf{z}\). This 1-D latent space represents the surgical path taken from the beginning of the procedure until the final desired anatomy is reached. The approach we take is to determine the anatomical structures visible in the sequence \(\textbf{S}_t\) along the surgical path and map the frame \(\textbf{x}_t\) to the latent space, where effectively the latent space acts as an implicit anatomical atlas. We refer to this as an implicit atlas because the position information along the surgical path is not available for construction of the latent space. To achieve this, we perform object detection on all frames \(\textbf{x}_{t-s:t}\) in \(\textbf{S}_{t}\) and obtain a sequence of detections \(\textbf{c}_{t-s:t}\) that we denote as \(\textbf{C}_t\). A detection \(\textbf{c}_t \in \mathbb {R}^{n \times 5}\) represents the anatomical structures and bounding boxes of the t-th frame, where n denotes the number of different classes in the surgery. More specifically, \(\textbf{c}_t\) consists of a binary variable \(\textbf{y}_t = [y_t^0,\dots ,y_t^n] \in \{0,1\}^{n}\) denoting the present structures (or classes) in the t-th frame and \(\textbf{b}_t = [\textbf{b}_t^0,\dots ,\textbf{b}_t^n]^T \in \mathbb {R}^{n \times 4}\) denoting the respective bounding box coordinates. An autoencoder architecture was used to achieve the embedding, i.e., mapping \(\textbf{C}_t\) to \(\textbf{z}_t\). The encoder maps \(\textbf{C}_t\) to \(\textbf{z}_{t}\), and the decoder generates \(\hat{\textbf{c}}_t\), which represents the detections of the last frame in a given sequence, given \(\textbf{z}_{t}\). The model parameters are updated to ensure that \(\hat{\textbf{c}}_t\) fits \(\textbf{c}_{t}\) on a training set as will be explained in the following.

Fig. 2.
figure 2

Left: Transsphenoidal adenomectomy procedure is performed to remove a tumor from the pituitary gland, located at the base of the brain. Through the use of an endoscope and various instruments, the surgeon inserts the instruments into the nostril and crosses the sphenoidal sinus to access the pituitary gland located behind the sella floor. All procedures in the dataset only accessed one nostril to perform the procedure instead of two. Right: A video frame showing only the anatomy. Note that there is lack of clear differences between anatomical structures in such images.

2.2 Object Detection

Our approach requires being able to detect anatomical structures as bounding boxes in frames from a video. To this end, the object detection part of the pipeline is fulfilled by an iteration of the YOLO network [17]. Specifically, the YOLOv7 network was used [27]. The network was trained on the endoscopic videos in the training set, where frames are sparsely labeled with bounding boxes, which contains 15 different anatomical classes and one surgical instrument class. The trained network was then applied to all the frames of the training videos to create detections of these classes on every frame of the videos. These are then used to train the subsequent autoencoder that models the embedding.

Fig. 3.
figure 3

The model architecture. The model consists of an encoder and two decoders. The encoder consists of a multi-head attention layer, i.e., a transformer encoder, which takes \(\textbf{C}_t\) as input, followed by a series of fully connected layers to embed the input in a 1D latent dimension. The two decoders consist of fully connected layers to generate the class probabilities \(\hat{\textbf{y}}_t\) and the bounding box coordinates \(\hat{\textbf{b}}_t\), respectively.

2.3 Embedding

To encode the bounding boxes onto the 1D latent space, the output from the YOLO network was slightly modified to exclude the surgical instrument, because the presence of the instrument in the frame is not necessarily correlated with the frame’s position along the surgical path. The autoencoder was designed to reconstruct only the last frame \(\textbf{c}_{t}\) in \(\textbf{C}_t\) because \(\textbf{z}_{t}\) is desired to correspond to the current position. However, it takes into account s previous frames to provide more information while determining the latent representation \(\textbf{z}_t\) of an \(\textbf{x}_t\).

The encoder of the autoencoder network consists of multi-head attention layers followed by fully connected layers, which eventually reduce the features to a 1D value. Here a transformer-based encoder is used to encode the temporal information in the sequence of detections. The decoder consists of two fully connected decoders, the first of which generates the class probabilities \(\hat{\textbf{y}}_t\) of \(\hat{\textbf{c}}_t\) and the second generates the corresponding bounding boxes \(\hat{\textbf{b}}_t\). A simplified representation of the network is shown in Fig. 3. The loss function consists of a classification loss and a bounding box loss, which is only calculated for the classes present in the ground truth. This results in the following objective to minimize for the t-th frame in the m-th training video:

$$\begin{aligned} \mathcal {L}_{m,t}=-\sum _{i=1}^n\left( y_{m,t}^i \log \left( \hat{y}_{m,t}^i\right) +\left( 1-y_{m,t}^i\right) \log \left( 1-\hat{y}_{m,t}^i\right) \right) +\sum _{i=1}^n y_{m,t}^i\left| \textbf{b}_{m,t}^i-\hat{\textbf{b}}_{m,t}^i\right| , \end{aligned}$$

where \(|\cdot |\) is the \(l_1\) loss and \(\hat{y}^i_{m,t}\) and \(\hat{\textbf{n}}^i_{m,t}\) are generated from \(\textbf{z}_{m,t}\) using the autoencoder. The total training loss is then obtained by summing \(\mathcal {L}_{m,t}\) over all frames and training videos. The proposed loss function can be considered to correspond to maximizing the joint likelihood of a given \(\textbf{y}\) and \(\textbf{b}\) with a probabilistic model that uses a mixture model for the bounding boxes.

3 Experiments and Results

3.1 Dataset

The object detection dataset used consists of 166 anonymized videos recorded during a transsphenoidal adenomectomy in 166 patients. The videos were recorded using various endoscopes and at multiple facilities, and made available through general research consent. The videos were labeled by neurosurgeons and include 16 different classes, that is, 15 different anatomical structure classes and one surgical instrument class. In total the dataset consists of approximately 19000 labeled frames, and around \(3\times 10^6\) frames in total. All the classes have only one instance in every video because of the anatomical nature of the human body, except for the instrument class, because of the various instruments being used during the procedures. Out of the 166 videos, 146 were used for training and validation, and 20 for testing. While we used different centers in our data, we acknowledge that all the centers are concentrated in one geographic location, which may induce biases in our algorithms. However, we also note that we use different endoscopes and they were acquired throughout the last 10 years.

3.2 Implementation Details

The implementation of the YOLO network follows [27] using an input resolution of \(1280 \times 1280\). The model reached convergence after 125 epochs. To generate the data to train the autoencoder, the object confidence score and intersection-over-union (IoU) threshold were set to 0.25 and 0.45, respectively.

The autoencoder uses a transformer encoder that consists of six transformer encoder layers with five heads and an input size of \(s \times 15 \times 5\), where s is set to 64 frames. Subsequently, the dimension of the output of the transformer encoder is reduced by three fully connected layers to 512, 256, and 128 using rectified linear unit (ReLU) activation functions in between. Finally, the last fully connected layer reduces the dimension to 1D and uses a sigmoid activation function to obtain the final latent variable. Furthermore, the two decoders, the class decoder and bounding box decoder, consist of two fully connected layers, increasing the dimension of the latent variable from 1D to 8, 15, and 32, \(15 \times 4\), respectively. The first layer of both decoders is followed by a ReLU activation function and the final layer by a sigmoid activation function.

For training of the autoencoder, the AdamW optimizer [12] was used in combination with a warm-up scheduler that linearly increases the learning rate from 0 to \(1\times 10^{-4}\) over 60 epochs. The model was trained for 170 epochs.

3.3 Results

Anatomical Structure Detection: The performance of the YOLO network on the test videos is shown in Table 1, using an IoU threshold for non-maximum suppression of 0.45 and an object confidence threshold of 0.001. The latter is set to 0.001 as this is the common threshold used in other detection works. It is surprising how well YOLO model works on the challenging problem of detecting anatomical structures in endoscopic neurosurgical videos.

Table 1. YOLO detection model results on 20 test videos with an IoU threshold for non-maximum suppression of 0.45 and an object confidence threshold of 0.001.

Qualitative Assessment of the Embedding: First, to evaluate the learned latent representation, we compute the confidences for every class, i.e., \(y^i\), for different points on the latent space, and plot them in Fig. 4. The confidences are normalized for every class, where the maximum confidence of a class corresponds to the darkest shade of blue, and vice versa. This shows how likely it is to find an anatomical structure at a certain location in the latent space, resembling a confidence interval for the structure’s presence along the surgical path.

Fig. 4.
figure 4

The normalized generated confidences of each class along the latent space. This visualizes the probability of finding a certain anatomical structure at a specific point in the latent space. Additionally, video frames of twenty test videos responsible for the first appearances of anatomical structures in every video have been encoded and overlaid onto the confidence intervals to demonstrate that their locations correlate with the beginning of these intervals.

Figure 4 shows how the autoencoder encodes and separates anatomical structures along the surgical path. For example, from left (\(z=0\), the start of the surgical path) to right (\(z=1\), the end of the surgical path), it can be seen that the septum is frequently visible at the start of the surgical path, but later it is no longer be visible. Because a sequence encodes to a single point in the latent space, positioning along the surgical path is possible and allows the forecasting of structures in both the forward and backward directions.

Furthermore, twenty test videos were used to validate the spatial embedding of the anatomical structures. For every single one of the videos, the frame of the first appearance of every anatomical structure was noted. To obtain the corresponding z-value for each of the noted frames, a sequence was created from the same frame, using the s previous frames, for all of the noted frames. These sequences were then embedded into the 1D latent dimension to determine whether their locations corresponded to the beginning of the confidence intervals in the latent space where corresponding anatomical structures were expected to start appearing. When examining the encodings of the video data, it is evident that the points are located on the left side of the confidence interval for every class. When considering a path from \(z=0\) to \(z=1\), this demonstrates that the autoencoder is able to accurately map the first appearances of every class in the validation videos to the beginning of the confidence intervals for the classes, showing the network is capable of relative positional embedding.

Figure 5 plots the z-value against time (t) over an entire surgical video, where \(t=1\) denotes the end of the video. The plot on the right shows how the endoscope is frequently extracted going from a certain z-value back to 0, the beginning of the surgical path, instantaneously. Retraction of the endoscope and replacement is common in this surgery and the plot reflects this. Subsequently, swift reinsertion of the endoscope to the region of interest is performed, spending little time at z-values inferior to the ones visited before extraction. Additionally, locations along the latent space are visible where more time is spent than others, such as around \(z=0.2\) and around \(z=0.6\), which correspond to locations where tissue is removed or passageways are created, such as the opening of the sphenoidal sinus. We also note that the z-value also shoots to z=1 at certain times. Z-values from 0.5 to 1.0 actually correspond to a narrow section of the surgical path. However, this narrow section is the crux of the surgery where the surgeon spends more time. Hence the behavior of the model is expected since more time spent in this section leads to a higher number of images, and ultimately, covers a larger section of the latent space.

Fig. 5.
figure 5

Z-values over time during a surgical video. Certain z-values are encoded more frequently than others, such as approximately \(z=0.2\) and \(z=0.6\), which is related to the amount of time spent at a certain location during the surgery.

Lastly, we used the decoder to generate bounding boxes moving along the 1D latent dimension from one end to the other. A GIF showing the bounding boxes of the classes that the model expected to encounter along the surgical path can be found here at the following link: https://gifyu.com/image/SkMIX. Certain classes are expected at different points, and their locations and sizes vary along the latent space. From appearance to disappearance of classes, their bounding boxes grow larger and their centers either move away from the center of the frame to the outer region of the frame or stay centered. This behavior is expected when moving through a tunnel structure, as in an endoscopic procedure. This shows that the latent space can learn a roadmap of the surgery with the expected anatomical structures at any location on the surgical path.

Quantitative Assessment of the Embedding: Beyond qualitative analyses, we performed quantitative analysis to demonstrate that the latent space spatially embeds the surgery path. As there are no ground truth labels on the position of a video frame on the surgical path, a direct quantitative analysis comparing z-value to ground truth position on the surgical path is not possible. To provide a quantitative evaluation, we make the observation that if the latent space represents the surgical path spatially, frames encoding z-values at the beginning of the path should be encountered in the early stages of the surgery, and vice versa. Therefore, the timestamp t of a sequence responsible for the first encoding of a specific z-value should increase with increasing z-value. This is confirmed by the mean correlation coefficient between t and \(\textbf{z}\) for the 20 videos, which is 0.80. Figure 6 shows the relation between t and \(\textbf{z}\) for five test videos with their corresponding Pearson correlation coefficients r for an untrained and a trained model.

Fig. 6.
figure 6

Latent variable plotted against time of its first encoding for 5 surgical videos. Pearson correlation coefficients for first-time of appearance and \(\textbf{z}\) values are given for an untrained (left) and trained (right) model. In these plots, \(t=1\) denotes the end of a video. The untrained model provides a baseline for expected correlation coefficients. High correlation coefficients suggest the embedding captures the relative position on the surgical path.

4 Conclusion

In this study, we propose a novel approach to neuronavigation based on deep learning. The suggested approach is live image-based and uses bounding box detections of anatomical structures to localize itself on a common surgical roadmap that is learned from a dataset containing numerous videos from a specific surgical procedure. The mapping is modeled by the use of an autoencoder architecture and trained without supervision. The method allows for the localization and forecasting of anatomical structures that are to be encountered in forward and backward directions along the surgical path, similar to a mapping application.

The presented work has also some limitations. The main limitation is that we focused on only one surgery in this initial work. Extension to other surgeries is our future research topic. The proposed method can also be combined with SLAM approaches as well as guidance provided by MRI. Both of these directions also form our future work. Another limitation is that the latent dimension only provides relative positional encoding. Going beyond this may require further labels on the real position on the surgical path.