Visual Modalities Based Multimodal Fusion for Surgical Phase Recognition

Park, Bogyu; Chi, Hyeongyu; Park, Bokyung; Lee, Jiwon; Park, Sunghyun; Hyung, Woo Jin; Choi, Min-Kook

doi:10.1007/978-3-031-18814-5_2

Bogyu Park¹³,
Hyeongyu Chi¹³,
Bokyung Park¹³,
Jiwon Lee¹³,
Sunghyun Park¹⁴,
Woo Jin Hyung^13,14 &
…
Min-Kook Choi¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13594))

Included in the following conference series:

International Workshop on Multiscale Multimodal Medical Imaging

563 Accesses

Abstract

We propose visual modalities-based multimodal fusion for surgical phase recognition to overcome the limitation of the diversity of information such as the presence of tools. Through the proposed methods, we extracted a visual kinematics-based index related to the usage of tools such as movement and the relation between tools in surgery. In addition, we improved recognition performance using the effective fusion method which is fusing CNN-based visual feature and visual kinematics-based index. The visual kinematics-based index is helpful for understanding the surgical procedure as the information related to the interaction between tools. Furthermore, these indices can be extracted in any environment unlike kinematics in robotic surgery. The proposed methodology was applied to two multimodal datasets to verify that it can help to improve recognition performance in clinical environments.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Distributed visual positioning for surgical instrument tracking

Article 09 January 2024

A multimodal virtual vision platform as a next-generation vision system for a surgical robot

Article Open access 02 February 2024

The development of non-contact user interface of a surgical navigation system based on multi-LSTM and a phantom experiment for zygomatic implant placement

Article 12 July 2019

Keywords

1 Introduction

Surgical workflow analysis using a computer-assisted intervention (CAI) system based on machine learning or deep learning has been extensively studied [1,2,3,4,5,6,7,8,9,10]. In particular, surgical phase recognition can help optimize surgery by activating communication between surgeons and staffs, not only for smooth teamwork, but also for efficient use of resources throughout the entire surgical procedure [11]. Moreover, it is valuable for monitoring the patient after surgery and educational materials through the classification of stereotyped surgical procedures [1]. However, phase recognition is a challenging task that involves many interactions between the actions of the tools and the organs. In addition, surgical video analysis has limitations such as video quality (i.e. occlusion and illumination change) and unclear annotations at event boundaries [2, 3].

Many studies that performed surgical workflow analysis have limitations due to performing analysis using only CNN-based visual features and information for the presence of tools in video. In this paper, to overcome this limitation, we introduce a visual modality-based multimodal fusion method that improves the performance of phase recognition by using interactions between the recognized tools. The proposed method extracts indices related to tools used in surgery and fuses them with visual features extracted from CNN. We demonstrate the effectiveness of proposed tool-related indices to improve performance by the VR simulator-based dataset and the collected gastrectomy dataset.

We have the following contributions:

We propose a method to extract a visual kinematics-based index related to tools that are helpful in surgical workflow analysis from visual modality such as semantic segmentation map.
In addition, it shows that it can be applied in environments where it is difficult to extract the kinematics of tools in a system unlike robotic surgery.
We propose a fusion method that improves recognition performance by effectively aggregating the visual kinematics-based index and visual features.

2 Related Works

Phase Recognition. In early machine learning-based research, a statistical analysis of temporal information using Hidden Markov Models (HMMs) and Dynamic Time Warping (DTW) was conducted [4]. Since then, as the use of deep learning has become more active, EndoNet [5] that recognizes tool existence through CNN-based feature extraction had been studied. MTRCNet-CL [6], which combines CNN and LSTM to perform multi-tasks, was also performed. Furthermore, a multi-stage TCN (MS-TCN)-based surgical workflow analysis study that performs hierarchically processes using temporal convolution was also performed [10]. Each stage was designed to refine the values predicted by the previous stage to return more accurate predictions. Previous studies had been conducted using only video information for analysis or additionally using only the presence of tools in the video. On the other hand, the proposed method uses a method of fusing visual features and indices related to tools.

Surgical Workflow Dataset. Datasets published to perform surgical workflow recognition include actual surgical videos like Cholec80 [5], toy samples for action recognition of a simple level such as JIGSAWS [12] and MISAW [13], and synthetic data generated from VR simulators PETRAW [14]. In the case of the JIGSAWS and MISAW, kinematic information of the instrument from the master-slave robotic platform was provided, so that more precise tool movements could be analyzed. However, in laparoscopic surgery, it was difficult to use kinematic information owing to the absence of a surgery robot. There was a limit to extracting and applying actual kinematic information due to security issues of the robotic surgery device. To address these problems, we use a method of generating tool-related indices from visual modality to replace kinematic information.

Multimodal Learning. The various modalities (i.e., video, kinematics) created in the surgical environment have different information about the surgical workflow. Multimodal learning aims to improve performance by using mutual information between each modality. However, researches on multimodal learning in surgical workflow analysis were still insufficient [5, 12,13,14,15]. In particular, there was a limitation because of related to data that is difficult to access or extract such as the kinematics of surgical tools. We propose a method to effectively achieve performance improvement by fusing various information generated from vision modalities through virtual or real data.

3 Methods

In this section, we propose an extraction manner of a visual kinematics-based index and a visual modality-based multimodal feature fusion method. We used two visual modalities: video and visual kinematics-based index. The visual kinematics-based index expresses the movement and relationship of surgical tools extracted from the semantic segmentation mask. To improve the phase recognition performance, we applied convolutional feature fusion to enhance the interaction of features extracted from visual modalities. The overall learning structure is shown in Fig. 1.

3.1 Visual Kinematics-based Index

A visual kinematics-based index was defined as an index expressing the relationship between tools and the movement of tools. These indices helped to understand the impact of the action of tools in surgical procedures. Actually, according to previous studies, surgical instrument index which included kinematics extracted from surgical robot or video was used to analyze the skill level of surgeon who performed surgery for all or part of the operation [15,16,17,18,19,20,21]. However, indices such as kinematics were extracted from the robot system and were hard to access. To solve this problem, we extracted the visual kinematics-based index by recognizing the tools from the semantic segmentation mask.

Types of Visual Kinematics-based Index. The visual kinematics-based index was consist of two types which are movement or relation between tools. Movement index was measured as {path length, velocity, centroids, speed, bounding box, economy of area} [21]. Movement index measurement is as follows:

$$\begin{aligned} PL=\sum _t^T \sqrt{(D(x,t))^2 + (D(y,t))^2},\quad D(x,t)=x_{t}-x_{t-1}. \end{aligned}$$

(1)

$$\begin{aligned} s=\frac{PL}{T},\quad v(x)=\frac{x_t-x_{t-\varDelta }}{\varDelta }. \end{aligned}$$

(2)

$$\begin{aligned} EOA=\frac{bw \times bh}{W \times H}. \end{aligned}$$

(3)

where PL is path length in the current time frame t and T is the time range for computing index. The path length consists of two types which are cumulative path length and partial path length. D(x, t) measures the difference of x coordinate between the previous and current time frame. x and y mean centroids of an object in the frame. Centroids are average positional values for X- and Y-coordinate in the semantic segmentation mask. s is the speed for time range T, and v is the velocity for the direction of X or Y at time interval $\varDelta $. bw and bh are the width and height of the bounding box, and W and H are the width and height of the image. Bounding box (BBox) is consist of four values such as top, left, box width, box height (bx, by, bw, bh).

Relation index was measured as {IoU, gIoU, cIoU, dIoU} [21,22,23]. gIoU, cIoU, and dIoU are modified versions of IoU. The index of IoU family is related to how close two objects are to each other. We considered $\{\lambda _1,...,\lambda _N\}$ to train phase recognition model by index combination experiments. $\lambda $ denotes a visual kinematics-based index.

3.2 Feature Fusion

The feature representation for each modality has different information regarding surgical workflow. The representation extracted from the video is related to the overall action in the scene, and the representation extracted from the visual index is related to the detailed movement of each tool. We designed a convolution-based feature fusion module for the interaction of representations to improve recognition performance. For performance comparison, a simple linear feature fusion method and a convolution-based feature fusion method were introduced.

Linear Feature Fusion. For each feature representation from modality, the linear fusion module is as follows:

$$\begin{aligned} f_i^m=\eta (\theta _m(x_i^m)), \quad m \in \{V, VKI\}. \end{aligned}$$

(4)

$$\begin{aligned} z_i = \psi ( \textrm{concat}( f_i^V, f_i^{VKI} ) ). \end{aligned}$$

(5)

where $f_i^m$ is a d-dimensional projected feature for each modality, $x_i^m$ is ith input data of modality m, and $\theta _m$ is a deep neural network based recognition model for each modality. V and VKI denote video and visual kinematics-based index. $\eta $ and $\psi $ are fusion blocks based on Multi-Layer Perceptron (MLP) layers for generating features of another view and aggregating features, respectively. The concatenated feature is aggregated to d-dimensional feature $z_i$ as the input classification layer.

Convolution Based Feature Fusion. Linear fusion module is not an effective approach due to the simple late-fusion method based on a vanilla fully-connected layer. The proposed convolution-based feature fusion module is effective in enhancing interaction between features for phase recognition. The proposed method is processed in 2 steps; 1) Stop gradient-based representation enhancement, 2) Convolutional feature aggregation as shown in Fig. 2.

$$\begin{aligned} g_i^m = \phi (f_i^m) \end{aligned}$$

(6)

We apply the stop gradient-based approach proposed in [24] to close the representations of modality with different views and to speed up the learning convergence speed. $g_i^m$ with the same dimension and different view is generated through a projector composed of MLP in Eq. 6. [24] used contrastive loss to learn similarity between representations. According to [24], the contrastive loss is defined as:

$$\begin{aligned} \mathcal {D}(a_i, b_i) = (\sum _{j=1}^d {|a_{i,j}-b_{i,j}|^p})^{1/p} \end{aligned}$$

(7)

$$\begin{aligned} \mathcal {L}_{con}(f_i^{m_1}, g_i^{m_2})=\frac{1}{2}\mathcal {D}(f_i^{m_1}, \textrm{stopgrad}(g_i^{m_2})) + \frac{1}{2}\mathcal {D}(\textrm{stopgrad}(f_i^{m_1}), g_i^{m_2}) \end{aligned}$$

(8)

where $a_i$ and $b_i$ are the feature representations of different views, p is the order of a norm and $m_1, m_2$ are consist with one of $\{V, VKI\}$. Unlike [24], the similarity is calculated using pairwise distance through the experiments. Fused feature representation $z_i$ is forwarded by convolution-based feature fusion as follows:

$$\begin{aligned} z_i = \varTheta ( \textrm{concat}( g_i^V, g_i^{VKI} ) ) \end{aligned}$$

(9)

where $\varTheta $ is a 1D convolution-based feature fusion block for kernel size k, $ z_i$ is used as input of classifier h to predict $\hat{y}$. Recognition loss $\mathcal {L}_{cls}$ is computed by cross-entropy loss and then total loss is defined as Eq. 11.

$$\begin{aligned} \mathcal {L}_{cls}=\textrm{CrossEntropyLoss}(\hat{y}, y), \quad \hat{y} = h(z_i) \end{aligned}$$

(10)

$$\begin{aligned} \mathcal {L}_{total}=\mathcal {L}_{con}+\mathcal {L}_{cls} \end{aligned}$$

(11)

4 Experiment Results

4.1 Base Setting

Dataset. We validated the proposed methods using two different datasets. 1) PETRAW [14] was released at challenge of MICCAI 2021. PETRAW dataset consisted of the pair which are video, kinematics of arms, and semantic segmentation mask generated from VR simulator. Training and test data were constructed with 90 and 60 pairs, respectively. The PETRAW had four tasks such as Phase(3), Step(13), Left action(7), and Right action(7); values in parentheses are the number of classes. 2) The 40 surgical videos for gastrectomy surgery which is called G40 were collected with da Vinci Si and Xi devices between January 2018 and December 2019. We constructed a 30:10 training and evaluation set by considering the patient’s demographic data such as {age, gender, pre_BMI, OP_time, Blood_loss, and length of surgery}. According to [3], G40 dataset was annotated for ARMES based 27 surgical phases by consensus of 3 surgeons. G40 consisted of video and semantic segmentation mask with 31 classes, including tools and organs for {harmonic ace, bipolar forceps, cadiere forceps, grasper, stapler, clip applier, suction irrigation, needle, gauze, specimen bag, drain tube, liver, stomach, pancreas, spleen, and gallbladder}. Each instrument consisted of a head, wrist, and body parts^{Footnote 1}.

Model. To train models for various modalities, we used Slowfast50 [25] with $\alpha $, $\beta $, and $\tau $ for video and Bi-LSTM [26] for kinematics and visual kinematics based index. The segmentation model was trained to predict semantic segmentation masks for generating an index. We used UperNet [27] with Swin Transformer [28] as backbone network.

Evaluation Metrics. We used various evaluation metrics which are accuracy of whole correctly classified samples, the average version of recall, precision, and F-1 score for classes each task to compare phase recognition results. All metrics were computed frame-by-frame. In all tables, we selected the best models by the average F1 score of tasks.

4.2 Performance Analysis

Table 1. Best combination experiments for visual kinematics based index on PETRAW. $\{\lambda _1,...,\lambda _N\}$ are indicated in order by cumulative path length(1), partial path length(2), velocity(3), speed(4), EOA(5), centroids(6), IoU(7), gIoU(8), dIoU(9) and cIoU(10). The best combination is selected by mF1-score.

Full size table

Important Feature Selection. We extracted various visual kinematics-based indices, and then what kinds of index pairs are positively affected by performance was evaluated on PETRAW in Table 1. $\lambda _1$ and $\lambda _2$ were related to performance improvement in all cases, and $\lambda _3$ was also significantly affected by performance. Figure 3 shows cumulative counts of the index for each combination of best and worst performance. In best combination, $\{\lambda _1, \lambda _2, \lambda _3, \lambda _6\}$ were mostly used but, $\lambda _6$ was also related to achieve worst performance. We used $N=5$ due to achieve the best performance in that combination. The index of the bounding box was included in all combination experiments because that is influenced by performance improvement in Table 2. The bounding box can be synergy by using other indices because it has the positional information (bx, by) and the information of object size (bw, bh). All indices with a bounding box obtained better performance compared to those not used it.

Table 2. Evaluation for impact of bounding box. Each row is the performance using a single index. The value in parentheses is the improvement in adding the bounding box, and the bold is the most significant improvement.

Full size table

Performance on PETRAW. We used an Adam optimizer with an initial learning rate of 1e-3, an L2 weight decay of 1e-5, a step scheduler for Bi-LSTM and convolution-based fusion method, and a cosine annealing scheduler with a warmup scheduler during 34 epochs for slowfast and linear fusion method. A batch size of 128 was used in all experimental environments. The learning rate decay rate was applied at 0.9 every five epochs for step scheduler. According to [25], $\alpha $, $\beta $, and $\tau $ were set $\{4, 8, 4\}$ in slowfast. The hidden layer size and output dimension of Bi-LSTM were set at 256 and 256, respectively. Projected feature size d set 512 for both fusion modules, and convolution kernel size k was 3. To address data imbalance, all networks used class-balanced loss [29] and trained for 50 epochs. We also used train and test datasets which were subsampled by 5 fps. The clip size was 8, and the time range T was the same as the clip size.

Table 3 shows mF1 performances for each modality on PETRAW dataset. The baselines, including video and kinematics, were compared to the visual kinematics-based index. Especially, performances of phase and step by visual kinematics based index were achieved similar performance compared to kinematics based performance. It verified that visual kinematics based index can be helpful to recognize the actions of tools in Tables 1, 2, and 3^{Footnote 2}. Furthermore, the proposed fusion technique achieved improved performance compared to baseline. Our fusion methodology was useful for fusing the representations by enhancing the interactions between features.

Performance on G40. As like setting of PETRAW, we used the same setting of training models. However, the initial learning rate was set 1e-2, weighted cross-entropy loss was used for slowfast, and a cosine annealing scheduler was used for all experiments. A batch size of 64 was used in all experimental environments, and all networks were trained for 50 epochs. The sampling rate was set 1 fps for train and test datasets. The clip size was 32, and the time range T was the same as the clip size. It also improved performance by using the visual kinematics-based index on G40 in Table 4. That is, the visual kinematics-based index was available to replace the kinematics in actual surgery.

Table 3. Performance change for each modality on PETRAW. {V, K, VKI} denote video, kinematics and visual kinematics based index.

Full size table

Table 4. Performance change of each modality on G40. mPrecision, mRecall, and mF1 are measured by the average of results for each class.

Full size table

4.3 Ablation Study

Visual Kinematics Based Index for Organs. The surgical procedure was related to the interaction between tools and organs. Therefore, relation indices of tools and organs can be helped for recognition performance. We evaluated the performance change by involving a relation index between tools and organs. We used $\lambda _8$ and $\lambda _{10}$ measured between tools and organs for considering the relationship. The comparison is shown in Table 5. Those indices were validated to help recognize the surgical procedure by improved performance.

Table 5. The comparative results for including indices of organs on G40. We compared by adding the relation index between tools and organs, including the liver, stomach, pancreas, spleen, and gallbladder.

Full size table

Change of Semantic Model. We evaluated the change in performance regarding segmentation models. We considered three models, DeeplabV3+ [30], UperNet [27], and OCRNet [31]. UperNet used Swin Transformer [28] as backbone network and HRNet [32] for OCRNet. We used the basic setting of MMSegmentation [33] to train models during 100 and 300 epochs on PETRAW and G40, respectively. According to accurate segmentation results, the performance was improved in Table 6.

Table 6. Performance change for various segmentation models on PETRAW. The values in table are mF1-score for each task.

Full size table

Table 7. Performance change for various segmentation models on G40.

Full size table

5 Conclusion

We proposed a visual modalities-based feature fusion method for recognizing surgical procedures. We extracted a visual kinematics-based index from a visual modality such as a semantic segmentation map and trained the model using the indices and visual features from CNN. We validated that our approach helped to recognize the surgical procedure in simple simulation (PETRAW) and actual surgery (G40). In addition, the visual kinematics-based index is expected to be helpful in non-robotic surgery like laparoscopic surgery due to generating them from visual modality. For further study, we will consider evaluating by extracting a visual kinematics-based index from other visual modalities such as the object detection model.

Notes

1.
Please refer supplementary material for class definition details and segmentation results on G40.
2.
Please refer to supplementary material for additional experimental results of Accuracy, mPrecision, mRecall, and mF1 on PETRAW.

References

Zisimopoulos, O., et al.: DeepPhase: surgical phase recognition in CATARACTS Videos. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11073, pp. 265–272. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00937-3_31
Chapter Google Scholar
Klank, U., Padoy, N., Feussner, H., Navab, N.: Automatic feature generation in endoscopic images. Int. J. Comput. Assist. Radiol. Surg. 3(3), 331–339 (2008). https://doi.org/10.1007/s11548-008-0223-8
Article Google Scholar
Hong, S., et al.: Rethinking generalization performance of surgical phase recognition with expert-generated annotations. arXiv preprint. arXiv:2110.11626 (2021)
Padoy, N., Blum, T., Ahmadi, S.-A., Feussner, H., Berger, M.-O., Navab, N.: Statistical modeling and recognition of surgical workflow. Med. Image Anal. 16(3), 632–641 (2012)
Article Google Scholar
Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36(1), 86–97 (2016)
Article Google Scholar
Jin, Y.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med. Image Anal. 59, 101572 (2020)
Article Google Scholar
Lecuyer, G., Ragot, M., Martin, N., Launay, L., Jannin, P.: Assisted phase and step annotation for surgical videos. Int. J. Comput. Assist. Radiol. Surg. 15(4), 673–680 (2020). https://doi.org/10.1007/s11548-019-02108-8
Article Google Scholar
Dergachyova, O., Bouget, D., Huaulmé, A., Morandi, X., Jannin, P.: Automatic data-driven real-time segmentation and recognition of surgical workflow. Int. J. Comput. Assist. Radiol. Surg. 11(6), 1081–1089 (2016). https://doi.org/10.1007/s11548-016-1371-x
Article Google Scholar
Loukas, C.: Video content analysis of surgical procedures. Surg. Endosc. 32(2), 553–568 (2017). https://doi.org/10.1007/s00464-017-5878-1
Article Google Scholar
Czempiel, T., et al.: TeCNO: surgical phase recognition with multi-stage temporal convolutional networks. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 343–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_33
Chapter Google Scholar
Maier-Hein, L., et al.: Surgical data science for next-generation interventions. Nat. Biomed. Eng. 1(9), 691–696 (2017)
Article Google Scholar
Gao, Y., et al.: Jhu-isi gesture and skill assessment working set (jigsaws): a surgical activity dataset for human motion modeling. In: MICCAI Workshop: M2cai, vol. 3 (2014)
Google Scholar
Huaulmé, A., et al.: Micro-surgical anastomose workflow recognition challenge report. Comput. Methods Programs Biomed. 212, 106452 (2021)
Article Google Scholar
Huaulmé, A., et al.: Peg transfer workflow recognition challenge report: does multi-modal data improve recognition? arXiv preprint. arXiv:2202.05821 (2022)
Khalid, S., Goldenberg, M., Grantcharov, T., Taati, B., Rudzicz, F.: Evaluation of deep learning models for identifying surgical actions and measuring performance. JAMA Netw. Open 3(3), e201664–e201664 (2020)
Article Google Scholar
Funke, I., Mees, S.T., Weitz, J., Speidel, S.: Video-based surgical skill assessment using 3D convolutional neural networks. Int. J. Comput. Assist. Radiol. Surg. 14(7), 1217–1225 (2019). https://doi.org/10.1007/s11548-019-01995-1
Article Google Scholar
Hung, A.J., Chen, J., Jarc, A., Hatcher, D., Djaladat, H., Gill, I.S.: Development and validation of objective performance metrics for robot-assisted radical prostatectomy: a pilot study. J. Urol. 199(1), 296–304 (2018)
Article Google Scholar
Lee, D., Yu, H.W., Kwon, H., Kong, H.J., Lee, K.E., Kim, H.C.: Evaluation of surgical skills during robotic surgery by deep learning-based multiple surgical instrument tracking in training and actual operations. J. Clin. Med. 9(6), 1964 (2020)
Article Google Scholar
Liu, D., et al.: Towards unified surgical skill assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9522–9531 (2021)
Google Scholar
Birkmeyer, J.D., et al.: Surgical skill and complication rates after bariatric surgery. N. Engl. J. Med. 369(15), 1434–1442 (2013)
Article Google Scholar
Oropesa, I., et al.: Eva: laparoscopic instrument tracking based on endoscopic video analysis for psychomotor skills assessment. Surg. Endosc. 27(3), 1029–1039 (2013). https://doi.org/10.1007/s00464-012-2513-z
Article Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)
Google Scholar
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-iou loss: Faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence 34, 12993–13000 (2020)
Google Scholar
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434 (2018)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019)
Google Scholar
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Google Scholar
Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 173–190. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_11
Chapter Google Scholar
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Google Scholar
MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation (2020)

Download references

Acknowledgement

“This research was funded by the Ministry of Health & Welfare, Republic of Korea (grant number : 1465035498 / HI21C1753000022).”

Author information

Authors and Affiliations

Vision AI, Hutom, Seoul, Republic of Korea
Bogyu Park, Hyeongyu Chi, Bokyung Park, Jiwon Lee, Woo Jin Hyung & Min-Kook Choi
Yonsei University College of Medicine, Seoul, Republic of Korea
Sunghyun Park & Woo Jin Hyung

Authors

Bogyu Park
View author publications
You can also search for this author in PubMed Google Scholar
Hyeongyu Chi
View author publications
You can also search for this author in PubMed Google Scholar
Bokyung Park
View author publications
You can also search for this author in PubMed Google Scholar
Jiwon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Sunghyun Park
View author publications
You can also search for this author in PubMed Google Scholar
Woo Jin Hyung
View author publications
You can also search for this author in PubMed Google Scholar
Min-Kook Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min-Kook Choi .

Editor information

Editors and Affiliations

Massachusetts General Hospital, Boston, MA, USA
Xiang Li
University of Sydney, Sydney, Australia
Jinglei Lv
Vanderbilt University, Nashville, TN, USA
Yuankai Huo
Peking University, Beijing, China
Bin Dong
University of Southern California, Los Angeles, CA, USA
Richard M. Leahy
Massachusetts General Hospital, Boston, MA, USA
Quanzheng Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Park, B. et al. (2022). Visual Modalities Based Multimodal Fusion for Surgical Phase Recognition. In: Li, X., Lv, J., Huo, Y., Dong, B., Leahy, R.M., Li, Q. (eds) Multiscale Multimodal Medical Imaging. MMMI 2022. Lecture Notes in Computer Science, vol 13594. Springer, Cham. https://doi.org/10.1007/978-3-031-18814-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-18814-5_2
Published: 12 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18813-8
Online ISBN: 978-3-031-18814-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Visual Modalities Based Multimodal Fusion for Surgical Phase Recognition

Abstract

Similar content being viewed by others

Distributed visual positioning for surgical instrument tracking

A multimodal virtual vision platform as a next-generation vision system for a surgical robot

The development of non-contact user interface of a surgical navigation system based on multi-LSTM and a phantom experiment for zygomatic implant placement

Keywords

1 Introduction

2 Related Works