Keywords

1 Introduction

Surgical workflow analysis using a computer-assisted intervention (CAI) system based on machine learning or deep learning has been extensively studied [1,2,3,4,5,6,7,8,9,10]. In particular, surgical phase recognition can help optimize surgery by activating communication between surgeons and staffs, not only for smooth teamwork, but also for efficient use of resources throughout the entire surgical procedure [11]. Moreover, it is valuable for monitoring the patient after surgery and educational materials through the classification of stereotyped surgical procedures [1]. However, phase recognition is a challenging task that involves many interactions between the actions of the tools and the organs. In addition, surgical video analysis has limitations such as video quality (i.e. occlusion and illumination change) and unclear annotations at event boundaries [2, 3].

Many studies that performed surgical workflow analysis have limitations due to performing analysis using only CNN-based visual features and information for the presence of tools in video. In this paper, to overcome this limitation, we introduce a visual modality-based multimodal fusion method that improves the performance of phase recognition by using interactions between the recognized tools. The proposed method extracts indices related to tools used in surgery and fuses them with visual features extracted from CNN. We demonstrate the effectiveness of proposed tool-related indices to improve performance by the VR simulator-based dataset and the collected gastrectomy dataset.

We have the following contributions:

  • We propose a method to extract a visual kinematics-based index related to tools that are helpful in surgical workflow analysis from visual modality such as semantic segmentation map.

  • In addition, it shows that it can be applied in environments where it is difficult to extract the kinematics of tools in a system unlike robotic surgery.

  • We propose a fusion method that improves recognition performance by effectively aggregating the visual kinematics-based index and visual features.

2 Related Works

Phase Recognition. In early machine learning-based research, a statistical analysis of temporal information using Hidden Markov Models (HMMs) and Dynamic Time Warping (DTW) was conducted [4]. Since then, as the use of deep learning has become more active, EndoNet [5] that recognizes tool existence through CNN-based feature extraction had been studied. MTRCNet-CL [6], which combines CNN and LSTM to perform multi-tasks, was also performed. Furthermore, a multi-stage TCN (MS-TCN)-based surgical workflow analysis study that performs hierarchically processes using temporal convolution was also performed [10]. Each stage was designed to refine the values predicted by the previous stage to return more accurate predictions. Previous studies had been conducted using only video information for analysis or additionally using only the presence of tools in the video. On the other hand, the proposed method uses a method of fusing visual features and indices related to tools.

Surgical Workflow Dataset. Datasets published to perform surgical workflow recognition include actual surgical videos like Cholec80 [5], toy samples for action recognition of a simple level such as JIGSAWS [12] and MISAW [13], and synthetic data generated from VR simulators PETRAW [14]. In the case of the JIGSAWS and MISAW, kinematic information of the instrument from the master-slave robotic platform was provided, so that more precise tool movements could be analyzed. However, in laparoscopic surgery, it was difficult to use kinematic information owing to the absence of a surgery robot. There was a limit to extracting and applying actual kinematic information due to security issues of the robotic surgery device. To address these problems, we use a method of generating tool-related indices from visual modality to replace kinematic information.

Multimodal Learning. The various modalities (i.e., video, kinematics) created in the surgical environment have different information about the surgical workflow. Multimodal learning aims to improve performance by using mutual information between each modality. However, researches on multimodal learning in surgical workflow analysis were still insufficient [5, 12,13,14,15]. In particular, there was a limitation because of related to data that is difficult to access or extract such as the kinematics of surgical tools. We propose a method to effectively achieve performance improvement by fusing various information generated from vision modalities through virtual or real data.

3 Methods

Fig. 1.
figure 1

Proposed visual modalities-based multimodal fusion method. The visual kinematics-based index and frame sequence extracted for the input frame sequence is used as input to the models for each modality. The feature representations of each modality are used as input to the fusion model for joint training.

In this section, we propose an extraction manner of a visual kinematics-based index and a visual modality-based multimodal feature fusion method. We used two visual modalities: video and visual kinematics-based index. The visual kinematics-based index expresses the movement and relationship of surgical tools extracted from the semantic segmentation mask. To improve the phase recognition performance, we applied convolutional feature fusion to enhance the interaction of features extracted from visual modalities. The overall learning structure is shown in Fig. 1.

3.1 Visual Kinematics-based Index

A visual kinematics-based index was defined as an index expressing the relationship between tools and the movement of tools. These indices helped to understand the impact of the action of tools in surgical procedures. Actually, according to previous studies, surgical instrument index which included kinematics extracted from surgical robot or video was used to analyze the skill level of surgeon who performed surgery for all or part of the operation [15,16,17,18,19,20,21]. However, indices such as kinematics were extracted from the robot system and were hard to access. To solve this problem, we extracted the visual kinematics-based index by recognizing the tools from the semantic segmentation mask.

Types of Visual Kinematics-based Index. The visual kinematics-based index was consist of two types which are movement or relation between tools. Movement index was measured as {path length, velocity, centroids, speed, bounding box, economy of area} [21]. Movement index measurement is as follows:

$$\begin{aligned} PL=\sum _t^T \sqrt{(D(x,t))^2 + (D(y,t))^2},\quad D(x,t)=x_{t}-x_{t-1}. \end{aligned}$$
(1)
$$\begin{aligned} s=\frac{PL}{T},\quad v(x)=\frac{x_t-x_{t-\varDelta }}{\varDelta }. \end{aligned}$$
(2)
$$\begin{aligned} EOA=\frac{bw \times bh}{W \times H}. \end{aligned}$$
(3)

where PL is path length in the current time frame t and T is the time range for computing index. The path length consists of two types which are cumulative path length and partial path length. D(xt) measures the difference of x coordinate between the previous and current time frame. x and y mean centroids of an object in the frame. Centroids are average positional values for X- and Y-coordinate in the semantic segmentation mask. s is the speed for time range T, and v is the velocity for the direction of X or Y at time interval \(\varDelta \). bw and bh are the width and height of the bounding box, and W and H are the width and height of the image. Bounding box (BBox) is consist of four values such as top, left, box width, box height (bxbybwbh).

Relation index was measured as {IoU, gIoU, cIoU, dIoU} [21,22,23]. gIoU, cIoU, and dIoU are modified versions of IoU. The index of IoU family is related to how close two objects are to each other. We considered \(\{\lambda _1,...,\lambda _N\}\) to train phase recognition model by index combination experiments. \(\lambda \) denotes a visual kinematics-based index.

3.2 Feature Fusion

The feature representation for each modality has different information regarding surgical workflow. The representation extracted from the video is related to the overall action in the scene, and the representation extracted from the visual index is related to the detailed movement of each tool. We designed a convolution-based feature fusion module for the interaction of representations to improve recognition performance. For performance comparison, a simple linear feature fusion method and a convolution-based feature fusion method were introduced.

Linear Feature Fusion. For each feature representation from modality, the linear fusion module is as follows:

$$\begin{aligned} f_i^m=\eta (\theta _m(x_i^m)), \quad m \in \{V, VKI\}. \end{aligned}$$
(4)
$$\begin{aligned} z_i = \psi ( \textrm{concat}( f_i^V, f_i^{VKI} ) ). \end{aligned}$$
(5)

where \(f_i^m\) is a d-dimensional projected feature for each modality, \(x_i^m\) is ith input data of modality m, and \(\theta _m\) is a deep neural network based recognition model for each modality. V and VKI denote video and visual kinematics-based index. \(\eta \) and \(\psi \) are fusion blocks based on Multi-Layer Perceptron (MLP) layers for generating features of another view and aggregating features, respectively. The concatenated feature is aggregated to d-dimensional feature \(z_i\) as the input classification layer.

Convolution Based Feature Fusion. Linear fusion module is not an effective approach due to the simple late-fusion method based on a vanilla fully-connected layer. The proposed convolution-based feature fusion module is effective in enhancing interaction between features for phase recognition. The proposed method is processed in 2 steps; 1) Stop gradient-based representation enhancement, 2) Convolutional feature aggregation as shown in Fig. 2.

Fig. 2.
figure 2

An illustration of convolution-based feature fusion module. Before feature fusion, enhancement for feature representation is performed by stop-gradient strategy. After then, features are aggregated by 1D-convolutional operation.

$$\begin{aligned} g_i^m = \phi (f_i^m) \end{aligned}$$
(6)

We apply the stop gradient-based approach proposed in [24] to close the representations of modality with different views and to speed up the learning convergence speed. \(g_i^m\) with the same dimension and different view is generated through a projector composed of MLP in Eq. 6. [24] used contrastive loss to learn similarity between representations. According to [24], the contrastive loss is defined as:

$$\begin{aligned} \mathcal {D}(a_i, b_i) = (\sum _{j=1}^d {|a_{i,j}-b_{i,j}|^p})^{1/p} \end{aligned}$$
(7)
$$\begin{aligned} \mathcal {L}_{con}(f_i^{m_1}, g_i^{m_2})=\frac{1}{2}\mathcal {D}(f_i^{m_1}, \textrm{stopgrad}(g_i^{m_2})) + \frac{1}{2}\mathcal {D}(\textrm{stopgrad}(f_i^{m_1}), g_i^{m_2}) \end{aligned}$$
(8)

where \(a_i\) and \(b_i\) are the feature representations of different views, p is the order of a norm and \(m_1, m_2\) are consist with one of \(\{V, VKI\}\). Unlike [24], the similarity is calculated using pairwise distance through the experiments. Fused feature representation \(z_i\) is forwarded by convolution-based feature fusion as follows:

$$\begin{aligned} z_i = \varTheta ( \textrm{concat}( g_i^V, g_i^{VKI} ) ) \end{aligned}$$
(9)

where \(\varTheta \) is a 1D convolution-based feature fusion block for kernel size k, \( z_i\) is used as input of classifier h to predict \(\hat{y}\). Recognition loss \(\mathcal {L}_{cls}\) is computed by cross-entropy loss and then total loss is defined as Eq. 11.

$$\begin{aligned} \mathcal {L}_{cls}=\textrm{CrossEntropyLoss}(\hat{y}, y), \quad \hat{y} = h(z_i) \end{aligned}$$
(10)
$$\begin{aligned} \mathcal {L}_{total}=\mathcal {L}_{con}+\mathcal {L}_{cls} \end{aligned}$$
(11)

4 Experiment Results

4.1 Base Setting

Dataset. We validated the proposed methods using two different datasets. 1) PETRAW [14] was released at challenge of MICCAI 2021. PETRAW dataset consisted of the pair which are video, kinematics of arms, and semantic segmentation mask generated from VR simulator. Training and test data were constructed with 90 and 60 pairs, respectively. The PETRAW had four tasks such as Phase(3), Step(13), Left action(7), and Right action(7); values in parentheses are the number of classes. 2) The 40 surgical videos for gastrectomy surgery which is called G40 were collected with da Vinci Si and Xi devices between January 2018 and December 2019. We constructed a 30:10 training and evaluation set by considering the patient’s demographic data such as {age, gender, pre_BMI, OP_time, Blood_loss, and length of surgery}. According to [3], G40 dataset was annotated for ARMES based 27 surgical phases by consensus of 3 surgeons. G40 consisted of video and semantic segmentation mask with 31 classes, including tools and organs for {harmonic ace, bipolar forceps, cadiere forceps, grasper, stapler, clip applier, suction irrigation, needle, gauze, specimen bag, drain tube, liver, stomach, pancreas, spleen, and gallbladder}. Each instrument consisted of a head, wrist, and body partsFootnote 1.

Model. To train models for various modalities, we used Slowfast50 [25] with \(\alpha \), \(\beta \), and \(\tau \) for video and Bi-LSTM [26] for kinematics and visual kinematics based index. The segmentation model was trained to predict semantic segmentation masks for generating an index. We used UperNet [27] with Swin Transformer [28] as backbone network.

Evaluation Metrics. We used various evaluation metrics which are accuracy of whole correctly classified samples, the average version of recall, precision, and F-1 score for classes each task to compare phase recognition results. All metrics were computed frame-by-frame. In all tables, we selected the best models by the average F1 score of tasks.

4.2 Performance Analysis

Table 1. Best combination experiments for visual kinematics based index on PETRAW. \(\{\lambda _1,...,\lambda _N\}\) are indicated in order by cumulative path length(1), partial path length(2), velocity(3), speed(4), EOA(5), centroids(6), IoU(7), gIoU(8), dIoU(9) and cIoU(10). The best combination is selected by mF1-score.

Important Feature Selection. We extracted various visual kinematics-based indices, and then what kinds of index pairs are positively affected by performance was evaluated on PETRAW in Table 1. \(\lambda _1\) and \(\lambda _2\) were related to performance improvement in all cases, and \(\lambda _3\) was also significantly affected by performance. Figure 3 shows cumulative counts of the index for each combination of best and worst performance. In best combination, \(\{\lambda _1, \lambda _2, \lambda _3, \lambda _6\}\) were mostly used but, \(\lambda _6\) was also related to achieve worst performance. We used \(N=5\) due to achieve the best performance in that combination. The index of the bounding box was included in all combination experiments because that is influenced by performance improvement in Table 2. The bounding box can be synergy by using other indices because it has the positional information (bxby) and the information of object size (bwbh). All indices with a bounding box obtained better performance compared to those not used it.

Fig. 3.
figure 3

The histogram of the visual kinematics-based index for best and worst performance. (a) Cumulative counts of each index on the combination of best performance (b) Cumulative counts of each index on the combination of worst performance.

Table 2. Evaluation for impact of bounding box. Each row is the performance using a single index. The value in parentheses is the improvement in adding the bounding box, and the bold is the most significant improvement.

Performance on PETRAW. We used an Adam optimizer with an initial learning rate of 1e-3, an L2 weight decay of 1e-5, a step scheduler for Bi-LSTM and convolution-based fusion method, and a cosine annealing scheduler with a warmup scheduler during 34 epochs for slowfast and linear fusion method. A batch size of 128 was used in all experimental environments. The learning rate decay rate was applied at 0.9 every five epochs for step scheduler. According to [25], \(\alpha \), \(\beta \), and \(\tau \) were set \(\{4, 8, 4\}\) in slowfast. The hidden layer size and output dimension of Bi-LSTM were set at 256 and 256, respectively. Projected feature size d set 512 for both fusion modules, and convolution kernel size k was 3. To address data imbalance, all networks used class-balanced loss [29] and trained for 50 epochs. We also used train and test datasets which were subsampled by 5 fps. The clip size was 8, and the time range T was the same as the clip size.

Table 3 shows mF1 performances for each modality on PETRAW dataset. The baselines, including video and kinematics, were compared to the visual kinematics-based index. Especially, performances of phase and step by visual kinematics based index were achieved similar performance compared to kinematics based performance. It verified that visual kinematics based index can be helpful to recognize the actions of tools in Tables 1, 2, and 3Footnote 2. Furthermore, the proposed fusion technique achieved improved performance compared to baseline. Our fusion methodology was useful for fusing the representations by enhancing the interactions between features.

Performance on G40. As like setting of PETRAW, we used the same setting of training models. However, the initial learning rate was set 1e-2, weighted cross-entropy loss was used for slowfast, and a cosine annealing scheduler was used for all experiments. A batch size of 64 was used in all experimental environments, and all networks were trained for 50 epochs. The sampling rate was set 1 fps for train and test datasets. The clip size was 32, and the time range T was the same as the clip size. It also improved performance by using the visual kinematics-based index on G40 in Table 4. That is, the visual kinematics-based index was available to replace the kinematics in actual surgery.

Table 3. Performance change for each modality on PETRAW. {V, K, VKI} denote video, kinematics and visual kinematics based index.
Table 4. Performance change of each modality on G40. mPrecision, mRecall, and mF1 are measured by the average of results for each class.

4.3 Ablation Study

Visual Kinematics Based Index for Organs. The surgical procedure was related to the interaction between tools and organs. Therefore, relation indices of tools and organs can be helped for recognition performance. We evaluated the performance change by involving a relation index between tools and organs. We used \(\lambda _8\) and \(\lambda _{10}\) measured between tools and organs for considering the relationship. The comparison is shown in Table 5. Those indices were validated to help recognize the surgical procedure by improved performance.

Table 5. The comparative results for including indices of organs on G40. We compared by adding the relation index between tools and organs, including the liver, stomach, pancreas, spleen, and gallbladder.

Change of Semantic Model. We evaluated the change in performance regarding segmentation models. We considered three models, DeeplabV3+ [30], UperNet [27], and OCRNet [31]. UperNet used Swin Transformer [28] as backbone network and HRNet [32] for OCRNet. We used the basic setting of MMSegmentation [33] to train models during 100 and 300 epochs on PETRAW and G40, respectively. According to accurate segmentation results, the performance was improved in Table 6.

Table 6. Performance change for various segmentation models on PETRAW. The values in table are mF1-score for each task.
Table 7. Performance change for various segmentation models on G40.

5 Conclusion

We proposed a visual modalities-based feature fusion method for recognizing surgical procedures. We extracted a visual kinematics-based index from a visual modality such as a semantic segmentation map and trained the model using the indices and visual features from CNN. We validated that our approach helped to recognize the surgical procedure in simple simulation (PETRAW) and actual surgery (G40). In addition, the visual kinematics-based index is expected to be helpful in non-robotic surgery like laparoscopic surgery due to generating them from visual modality. For further study, we will consider evaluating by extracting a visual kinematics-based index from other visual modalities such as the object detection model.