Numerous advantages of minimally invasive surgery such as shorter recovery time, less pain and blood loss, and better cosmetic results, make it the preferred choice over conventional open surgeries [1]. In laparoscopy, the surgical instruments are inserted through small incisions in the abdominal wall and the procedure is monitored using a laparoscope. The special way of manipulating the surgical instruments and the indirect observation of the surgical scene introduce more challenges in performing laparoscopic procedures [2]. The complexity of laparoscopy requires special training and assessment for the surgery residents to gain the required bi-manual dexterity. Analyzing the streaming videos during the surgery and the recorded videos from previously accomplished procedures can potentially improve the outcomes. The tedium and cost of such an analysis can be dramatically reduced using an automated tool detection system, among other things and is, therefore, the focus of this paper.

Tracking surgical tools is essential in understanding the workflow of a procedure and in the assessment and rating of the videos. For example, it has been shown that experts have a better economy of motion compared to novice or less experienced surgeons [3, 4]. Also, by detecting the tools, we can check for wrong tool usage, monitor activation time of electro-surgical tools, and the use of proper technique (how a needle is positioned and moved with a needle driver during suturing), etc.

Manual annotation of long videos from surgeries is a time-consuming and expensive task. A vision-based algorithm for automated detection of the presence, location, or movement of surgical tools is indispensable in designing a fast and objective surgical evaluation system. A well-annotated database of surgical videos can also be used in information retrieval and is a reliable source for education and training of the future surgeons.

During surgery, monitoring the usage of surgical tools can provide real-time feedback to the surgeons and operating room staff. Furthermore, in computer-aided intervention, the surgical tools are controlled by a surgeon with the aid of a specially designed robot [5], which requires a real-time understanding of the current task. Therefore, detecting the presence, location, or pose of the surgical instruments is useful in robotic surgeries as well [6,7,8]. Finally, an automated tool usage detector can help to generate an operative summary.

To track surgical instruments, several approaches have been introduced, which use the collected signals during the procedure. For instance, in vision-based methods, the instruments can be localized using the videos captured during the operation. These methods are generally reliable and inexpensive. Traditional vision-based methods rely on extracted features such as shape, color, the histogram of oriented gradients, etc., along with a classification or regression method to estimate the presence, location, or pose of the instrument in the captured images or videos. However, these methods are dependent on pre-defined and painstakingly extracted hand-crafted features. Just logically defining and extracting such features alone is a major part of the detection process. Thus, these hand-crafted features and designs are not suitable for real-time applications.

Compared with the other surgical video tasks, detecting the presence and usage of surgical instruments in laparoscopic videos has certain challenges that need to be considered.

Firstly, since multiple instruments might be present at the same time, detecting the presence of these tools in a video frame is a multilabel (ML) classification problem. In general, ML classification is more challenging compared to the well-studied multiclass (MC) problem, where every instance is related to only one output. These challenges include but are not limited to using correlation and co-existence of different objects/concepts with each other and the background/context and the variations in the occurrence of different objects.

Second, as opposed to other surgical videos, such as cataract surgery, robot-assisted surgery, or videos from a simulation, where the camera is stationary or moving smoothly, in laparoscopic videos, the camera is constantly shaking. Due to the rapid movement and changes in the field of view of the camera, most of the images suffer from motion blur, and the objects can be seen in various sizes and locations. Also, the camera view might be blocked by the smoke caused by burning tissue during cutting or cauterizing to arrest bleeding. Therefore, using still images is not sufficient for detecting the instruments.

Third, surgical operations follow a specific order of tasks. Although the usage of the tools does not strictly adhere to that order, it is nevertheless highly correlated with the task being performed. The performance of the tool detection can be improved with the information about the task and the relative position of the frame with regard to the entire video.

At last, since the performance of a deep classifier in a supervised learning method is highly dependent on the size and the quality of the labeled dataset, collecting and annotating a large dataset is a crucial task.

Recent years have witnessed great advances in deep-learning techniques in various computer vision areas such as image classification, object detection, and segmentation etc., and in medical imaging [9]. Therefore, there is a trend towards using these methods in analyzing the videos taken from laparoscopic operations.

Endonet [10] was the first deep-learning model designed for detecting the presence of surgical instruments in laparoscopic videos, wherein Alexnet [11] was used as a Convolutional Neural network (CNN), for feature extraction and was trained for the simultaneous detection of surgical phases and instruments. Inspired by this work, other researchers used different CNN architectures [12, 13] to classify the frames based on the visual features. For example, in [14], three CNN architectures were used and [15] proposed an ensemble of two deep CNNs.

Sahu et al. [16] were the first to address the imbalance in the classes in a MultiLabel (ML) classification of video frames. They balanced the training set according to the combinations of the instruments. The data were re-sampled to have a uniform distribution in label-set space and, class re-weighting was used to balance the data. Despite the improvement gained by considering the co-occurrence in balancing the training set, the correlation of the tools’ usage was not considered directly in the classifier and the decision was made solely based on the presence of single tools. Alshirbaji et al. [17] used class weights and re-sampling together to deal with the imbalance issue.

In order to consider the temporal features of the videos, Twinanda et al. employed a hidden Markov model (HMM) in [10] and Recurrent Neural Network (RNN) in [18]. Sahu et.al utilized a Gaussian distribution fitting method in [12] and a temporal smoothing method based on a moving average in [16] to improve the classification results, after the CNN was trained. Mishra et al. [19] were the first to apply a Long Short-Term Memory model (LSTM) [20], as an RNN to a short sequence of frames, to simultaneously extract both spatial and temporal features for detecting the presence of the tools by end-to-end training.

A variety of different approaches were as following. Hu et al. [21] proposed an attention-guided method using two deep CNNs to extract local and global spatial features. In [22], a boosting mechanism was employed to combine different CNNs and RNNs. In [23], the tools were localized, after labeling the dataset with bounding boxes containing the surgical tools.

It should be noted that none of the previous methods takes advantage of any knowledge regarding the order of the tasks and, the correlations of the tools are not directly utilized in identifying different surgical instruments. In this paper, we propose a novel context-aware model called LapTool-Net to detect the presence of surgical instruments in laparoscopic videos. The uniqueness of our approach is based on the following three original ideas:

  • A novel ML classifier is proposed as a part of LapTool-Net, to take advantage of the co-occurrence of different tools in each frame—in other words, the context is taken into account in the detection process.

  • The ML classifier and the decision model are trained in an end-to-end fashion.

  • The model's prediction for each video is sent to another RNN to consider the order of the usage of different tools/tool combinations and long-term temporal dependencies; yet another consideration for the context.

The pre-print version of this paper with more results and detailed discussions can be found in [24]. The preliminary results were presented at the SAGES 2017 Annual Meeting.

Materials and methods

The overview of the proposed model is illustrated in Fig. 1. The goal is to design a classifier that maps the frames of surgical videos, to the tools in the observed scene. The overall system is described based on the dataset from M2CAI16Footnote 1 tool detection challenge, which is a subset of Cholec80 dataset [10]. We chose the smaller dataset to highlight the improvements caused by the main contributions of this paper. The dataset contains 15 videos from cholecystectomy procedure, which is the surgery for removing the gallbladder. All the videos are labeled with seven tools for every 25 frames. The tools are Bipolar, Clipper, Grasper, Hook, Irrigator, Scissors, and Specimen bags. There are ten videos for training and five videos for validation. The type and shape of all seven tools remain the same for the training and validation sets.

Fig. 1
figure 1

Block diagram of the proposed classifier for detecting the presence of surgical tools in each frames of a laparoscopic video

Since the publicly available Cholec80 dataset was used in this study to train and test our deep-learning model, an Institutional Review Board (IRB) approval is not required for this study.

Spatio-temporal features

To detect the presence of surgical instruments in laparoscopic videos, the visual features (intra-frame spatial and inter-frame temporal features) need to be extracted. We use CNN to extract spatial features. CNN is a type of artificial neural network that is capable of processing still images and has been successfully applied to many computer vision tasks that involve image classification or object recognition. As shown in Fig. 1, the input frame \({x}_{ij}\) is sent through the trained CNN and the output of the last convolutional layer (after pooling) forms a fixed size spatial feature vector \({v}_{ij}\).

Since there is a high correlation among video frames, it can be exploited by an RNN to improve the performance of the tool detection algorithm. An RNN uses its internal memory (states) to process a sequence of inputs for time series and videos-processing tasks [25]. This helps the model to identify the tools even when they are occluded or not clear due to motion blur. For this purpose, short sequences of frames (say 5 frames) are selected. We called the model consisting of a CNN and an RNN, a Recurrent Convolutional Neural Network (RCNN).

For each frame \({x}_{ij}\), the sequence of the spatial features is the input for the RNN. The total length of the input is no longer than one second, which ensures that the tools remain visible during that time interval. We selected Gated Recurrent Unit (GRU) [26] as our RNN for its simplicity. The final hidden state \({h}_{ij}\) is the output of the GRU and is the input to a fully connected neural network FC1.

Tool combination

In a laparoscopic cholecystectomy surgery, not all the \({2}^{K}\) combinations are possible as the total number of incisions are typically 3 or 4. Figure 2 shows the percentage of the most likely combinations in the M2CAI dataset. The first 15 classes out of a possible maximum of 128 span more than 99.5% of the frames in both the training and the validation sets, and the tools combinations have almost the same distribution in both cases. Extracting the pattern in the surgical tool combination can potentially improve the performance of an automated tool detection algorithm. Furthermore, modeling the tools’ co-occurrence is beneficial for assessing the performance by monitoring the wrong combinations.

Fig. 2
figure 2

The distribution for the combination of the tools in M2CAI dataset

To consider the tool combinations, in the well-known Label Power-set (LP) method, multiple tools are combined into one superclass (combination) and the problem is transformed into a multiclass classification. The advantage of LP is that the class dependencies are automatically considered. Also, by eliminating uncommon combinations from the outputs, the classifier's attention is directed towards the more possible combinations.

Since an LP classifier is MC, training a deep-learning model with Softmax loss requires the classes to be mutually exclusive. In other words, each superclass is treated as a separate class, i.e., separate features activate a superclass. This causes performance degradation in the classifier and therefore, more data are required for training. We address this issue by a novel use of LP as the decision model g, which we apply to the ML classifier f. The decision model is a fully connected neural network (FC2), which takes the confidence scores of f and maps them to the corresponding superclass (Fig. 1). Our method helps the classifier to consider our superclasses as the combinations of classes rather than separate mutually exclusive classes.

Class imbalance

In a laparoscopic surgery, some tools are used more often than the others. For instance, in our dataset, Grasper is present in almost 80% of the procedure, whereas the Scissors are visible in less than five seconds in each video. It is known that in skewed datasets, the classifier's decision is inclined towards the majority classes. Therefore, it is always beneficial to have a uniform distribution for the classes during training. This can be accomplished using over-sampling for the minority classes and under-sampling for the majority classes. However, in ML classification, finding a balancing criterion for re-sampling is challenging [27].

To overcome imbalance, we perform under-sampling to have a uniform distribution of the combination of the classes. The main advantage of under-sampling over other re-sampling methods is that it can also be applied to avoid overfitting caused by the high correlation between the neighboring frames of a laparoscopic video. Therefore, we try different under-sampling rates to find the smallest training set without sacrificing the performance.

Figure 3 shows the relationship among the tools after re-sampling. It can be seen that the LP-based balancing method not only tends to a uniform distribution in the superclass space, it also improves the balance of the dataset in the single class space (with the exception of Grasper, which can be used with all the tools).

Fig. 3
figure 3

The chord diagram for the relationship between the tools before and after balancing based on the tools' co-occurrences

Training

We train the model to simultaneously identify the presence of each tool and the tools combinations. Having the vector of the confidence scores P, the ML loss \({L}_{f}\) is the sigmoid cross-entropy (CE) and the Softmax CE loss function \({L}_{g}\) is used for training the decision model. We use the joint training paradigm for optimizing the ML, and MC losses as a multitask-learning approach.

The trainable weights for the ML optimizer are all the weights in the CNN, the weights in the RNN, and FC1. On the other hand, for the MC optimizer, the CNN, RNN, and FC2 are trainable. Note that the shared weights between the two optimizers are the RCNN weights. By keeping the FC1 layer untouched by the MC optimizer, the spatio-temporal features are extracted by the RCNN, considering both the presence of each tool and the combination of them, and FC2 is solely trained as a decision model.

Post-processing

To smooth the RCNN prediction and consider the long-term ordering of the tools, we model the order in the usage of the tools with an RNN over all the frames of each video [28]. Due to memory constraints, the final predictions of the RCNN \(\left(j\right)\) are selected as the input for the post-processing RNN.

In the online mode, only the past frames are available for classifying the current frame. In the offline mode, future frames can also be used along with past frames to improve the classification results of the current frame. To accomplish this, a bi-directional RNN is employed. The post-processing RNN is a two-layer GRU with 128 and 32 units in each layer.

The post-processing method described in this section is similar to [22] in extracting the long-term temporal features using RNNs. However, in contrast to these researchers, we used the final predictions of the RCNN model instead of the vector of confidence scores of the tools. Besides containing the information about the co-occurrences, training RNNs can be accomplished easier with a single scalar versus the vector of the size of the total number of tools or the tools' combinations. With the aid of the shorter size input, we were able to train larger sequences, even after performing the temporal data augmentation (to be explained later).

Results

In this section, the performance of the different parts of the proposed tool detection model is validated through numerous experiments using the appropriate metrics. We selected Tensorflow [29] for all of the experiments. The CNN in all the experiments was Inception-V1 [30]. To have better generalization, extensive data augmentation, such as random cropping, horizontal and vertical flipping, rotation and a random change in brightness, contrast, saturation, and hue were performed during training. The initial learning rate was 0.001 with a decay rate of 0.7 after 5 epochs, and the results were taken after 100 epochs. The batch size was 32 for training the CNN models and 40 for the RNN-based models. All the experiments were conducted using an Nvidia TITAN XP GPU.

LapTool-Net results on M2CAI dataset

Since the dataset was labeled only for one frame per second (out of 25 frames/sec), there was a possibility of using the unlabeled frames for training, as long as the tools remain the same between two consecutive labeled frames. We used this unlabeled data to balance the training set, according to the LPs.

To balance the datasets, 15 superclasses were selected and the original frames were re-sampled to have a uniform distribution. The numbers of frames for each superclass were randomly selected to be 400, forming a training set of 6000 frames. In other words, under-sampling was performed based on the tool combinations.

We tested the model before and after adding the decision model. For training the RCNN model, we used 5 frames at a time (current frame and 4 previous frames) with an inter-frame interval of 5, which resulted in a total distance of 20 frames between the first and last frames. The RCNN model was trained with a Stochastic Gradient Descent (SGD) optimizer. The data augmentation for the post-processing model includes adding random noise to the input and randomly dropping frames to change the duration of the sequences; the final predictions of the RCNN model are saved every 20 frames, and the frames are dropped with the probability of 10–30%. Table 1 shows the results of the proposed RCNN and LapTool-Net.

Table 1 Final results for the proposed model on M2CAI dataset

In the table, CNN represents the model that uses only still images, CNN-LP is the results after considering the tool combinations in still images, RCNN considers spatio-temporal features from several successive frames, and LapTool-Net represents the performance of the mode after considering the long-term ordering of the tools usages.

It can be seen that by considering the temporal features through the RCNN model, the exact match accuracy and F1-macro were improved by 3.15% and 7.52%, respectively. Also, the F1-macro improves by 2.94% after adding the LP decision model.

The higher performance of the LapTool-Net, shown in Table 1, is due to consideration of the long-term order of the usage of the tools. In the offline mode, the utilization of the frames from both the past and the future of the current frame causes the improvements over the online model in accuracy and F1-scores.

To check the effectiveness of the multitask approach used for the end-to-end training of the RCNN-LP model, we took the output of the ML classifier, after removing the decision model from the trained RCNN-LP. In other words, we replaced the LP-based decision layer of the trained model with the threshold-based decision method. The results are shown in Table 2. It is worth mentioning that this results show that the RCNN model without the LP decision can be taken for making prediction for all the combinations including the rare combinations that were originally excluded during training.

Table 2 The precision, recall, and F1-score of each tool for the ML classifier in RCNN-LP after removing the decision model

In order to localize the predicted tools, the attention maps were visualized using grad-CAM method [31]. The results for some of the frames are shown in Fig. 4. In order to avoid confusion with frames that multiple tools, only the class activation map of a single tool is shown based on the prediction of the model. The results show that the visualization of the attention of the proposed model can also be used in reliably identifying the location of each tool without any additional annotations for the location and shape of the tools.

Fig. 4
figure 4

The visualization of the class activation maps for some examples, based on the prediction of the model

Comparison with current work

To validate the proposed model, we compared it with previously published research on the M2CAI dataset. The result is shown in Table 3. We show that our model outperformed previous methods by a significant margin even when choosing a relatively shallower model (Inception-V1) and while using less than 25% of the labeled images.

Table 3 Comparison of tool presence detection methods on M2CAI

It is worth mentioning that a fair comparison with previous work on the same dataset is not feasible, since the evaluation metrics might not be the same. Nevertheless, we compared our ML classifier f, which is the RCNN model, along with the final models to show the superiority of our balancing and temporal consideration methods. Regardless of the choice of the CNN architecture, which is the most dominant component that can affect the results, the superiority of our model over the works in Table 3 is due to the end-to-end temporal consideration and the inclusion of the context such as the co-occurrence and tasks ordering, which are the main contributions of this paper.

LapTool-Net results on Cholec80 dataset

In this section, the performance of our model is evaluated on a larger dataset of laparoscopic cholecystectomy videos called Cholec80. We used the first 40 videos for training and the remaining 40 videos for testing our model.

The total number of tool combinations in Cholec80 dataset is 32, out of which 20 combinations are present in over 99.5% of the duration of videos. Compared with M2CAI dataset, the higher number of tool combinations is due to the more diversity in the larger dataset. Nonetheless, the extra five superclasses in Cholec80 dataset contain less than 0.4% of all frames. For each of the 20 tool combinations, 1500 samples were selected, forming a uniform class distribution on 30 K frames.

We used the same model as for M2CAI dataset for extracting the spatio-temporal features, the decision policy, and the post-processing step, as well as the training strategy. The results for the different parts of the model are shown in Table 4. Compared with the M2CAI results in Table 1, we can see significant improvement in accuracy and F1-scores. For example, the F1-macro of the CNN on the balanced Cholec80 is 9.19% higher than M2CAI dataset.

Table 4 Final results for the proposed model on Cholec80 dataset

As was to be expected, the accuracy and F1-scores increase after adding the LP-based decision layer. However, the improvements are relatively smaller compared with the M2CAI results. For instance, the F1-macro of the RCNN-LP is less than one percent higher than RCNN. Similarly, the increase in the F1-macro for the CNN and RCNN is less compared with M2CAI dataset (less than 5% versus over 10% in M2CAI). The reason behind this observation is likely due to the fact that while the end-to-end training of the CNN, RNN, and LP layer results in the richer discriminating features, considering the co-occurrence and temporal coherence, the performance is dominated and bounded by the capacity of the CNN.

Discussion

In this paper, we proposed a novel system called LapTool-Net, for automatically detecting the presence of tools in every frame of a laparoscopic video. The main feature of the proposed RCNN model is the context awareness, i.e., the model learns the short-term and long-term patterns of the usage of the tools by utilizing the correlation between the usage of the tools with each other and, with the surgical steps. Our method outperformed all previously published results on M2CAI dataset, while using less than 1% of the total frames in the training set.

While our model is designed based on the previous knowledge of the cholecystectomy procedure, it does not require any domain-specific knowledge from experts and can be effectively applied to any video captured from laparoscopic or even other forms of surgeries. Also, the relatively small training set after under-sampling suggests that the labeling process can be accomplished faster by using fewer frames (e.g., one frame every 5 s). Moreover, the simple architecture of the proposed LP-based classifier makes it easy to use it with other proposed models such as [22] and [21], or with weakly supervised models [32, 33] to localize the tools in the frames. To accomplish that, the threshold mechanism of the ML classifier in all these papers can be simply replaced by our combination-aware decision model.