Medical Transformer: Gated Axial-Attention for Medical Image Segmentation

Valanarasu, Jeya Maria Jose; Oza, Poojan; Hacihaliloglu, Ilker; Patel, Vishal M.

doi:10.1007/978-3-030-87193-2_4

Jeya Maria Jose Valanarasu¹⁵,
Poojan Oza¹⁵,
Ilker Hacihaliloglu¹⁶ &
…
Vishal M. Patel¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12901))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

20k Accesses
496 Citations
1 Altmetric

Abstract

Over the past decade, deep convolutional neural networks have been widely adopted for medical image segmentation and shown to achieve adequate performance. However, due to inherent inductive biases present in convolutional architectures, they lack understanding of long-range dependencies in the image. Recently proposed transformer-based architectures that leverage self-attention mechanism encode long-range dependencies and learn representations that are highly expressive. This motivates us to explore transformer-based solutions and study the feasibility of using transformer-based network architectures for medical image segmentation tasks. Majority of existing transformer-based network architectures proposed for vision applications require large-scale datasets to train properly. However, compared to the datasets for vision applications, in medical imaging the number of data samples is relatively low, making it difficult to efficiently train transformers for medical imaging applications. To this end, we propose a gated axial-attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. Furthermore, to train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance. Specifically, we operate on the whole image and patches to learn global and local features, respectively. The proposed Medical Transformer (MedT) is evaluated on three different medical image segmentation datasets and it is shown that it achieves better performance than the convolutional and other related transformer-based architectures. Code: https://github.com/jeya-maria-jose/Medical-Transformer

Access provided by Autonomous University of Puebla. Download conference paper PDF

UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

ConTrans: Improving Transformer with Convolutional Attention for Medical Image Segmentation

DmADs-Net: dense multiscale attention and depth-supervised network for medical image segmentation

Article 21 June 2024

Keywords

1 Introduction

Developing automatic, accurate, and robust medical image segmentation methods have been one of the principal problems in medical imaging as it is essential for computer-aided diagnosis and image-guided surgery systems. Segmentation of organs or lesion from a medical scan helps clinicians make an accurate diagnosis, plan the surgical procedure, and propose treatment strategies. Following the popularity of deep convolutional neural networks (ConvNets) in computer vision, ConvNets were quickly adopted for medical image segmentation. Networks like U-Net [15], V-Net [13], 3D U-Net [3], Res-UNet [25], Dense-UNet [11], Y-Net [12], U-Net++ [28], KiU-Net [19, 20] and U-Net3+ [7] have been proposed specifically for performing image and volumetric segmentation for various medical imaging modalities. These methods achieve impressive performance on many difficult datasets, proving the effectiveness of ConvNets in learning discriminative features to segment the organ or lesion from a medical scan.

ConvNets are currently the basic building blocks of most methods proposed for image segmentation. However, they lack the ability to model long-range dependencies present in an image. More precisely, in ConvNets each convolutional kernel attends to only a local-subset of pixels in the whole image and forces the network to focus on local patterns rather than the global context. There have been works that have focused on modeling long-range dependencies for ConvNets using image pyramids [26], atrous convolutions [2] and attention mechanisms [8]. However, it can be noted that there is still a scope of improvement for modeling long-range dependencies as the majority of previous methods do not focus on this aspect for medical image segmentation tasks.

To first understand why long-range dependencies matter for medical images, we visualize an example ultrasound scan of a preterm neonate and segmentation predictions of brain ventricles from the scan in Fig. 1. For a network to provide an efficient segmentation, it should be able to understand which pixels correspond to the mask and which to the background. As the background of the image is scattered, learning long-range dependencies between the pixels corresponding to the background can help in the network to prevent miss-classifying a pixel as the mask leading to reduction of false positives (considering 0 as background and 1 as segmentation mask). Similarly, whenever the segmentation mask is large, learning long-range dependencies between the pixels corresponding to the mask is also helpful in making efficient predictions. In Fig. 1(b) and (c), we can see that the convolutional networks miss-classify the background as a brain ventricle while the proposed transformer-based method does not make that mistake. This happens as our proposed method learns long-range dependencies of the pixel regions with that of the background.

In many natural language processing (NLP) applications, transformers [4] have shown to be able to encode long-range dependencies. This is due to the self-attention mechanism which finds the dependency between given sequential input. Following their popularity in NLP applications, transformers have been adopted to computer vision applications very recently [5, 18]. With regard to transformers for segmentation tasks, Axial-Deeplab [22] utilized the axial attention module [6], which factorizes 2D self-attention into two 1D self-attentions and introduced position-sensitive axial attention design for segmentation. In Segmentation Transformer (SETR) [27], a transformer was used as encoder which inputs a sequence of image patches and a ConvNet was used as decoder resulting in a powerful segmentation model. In medical image segmentation, transformer-based models have not been explored much. The closest works are the ones that use attention mechanisms to boost the performance [14, 24]. However, the encoder and decoder of these networks still have convolutional layers as the main building blocks.

It was observed that the transformer-based models work well only when they are trained on large-scale datasets [5]. This becomes problematic while adopting transformers for medical imaging tasks as the number of images, with corresponding labels, available for training in any medical dataset is relatively scarce. Labeling process is also expensive and requires expert knowledge. Specifically, training with fewer images causes difficulty in learning positional encoding for the images. To this end, we propose a gated position-sensitive axial attention mechanism where we introduce four gates that control the amount of information the positional embedding supply to key, query, and value. These gates are learnable parameters which make the proposed mechanism to be applied to any dataset of any size. Depending on the size of the dataset, these gates would learn whether the number of images would be sufficient enough to learn proper position embedding. Based on whether the information learned by the positional embedding is useful or not, the gate parameters either converge to 0 or to some higher value. Furthermore, we propose a Local-Global (LoGo) training strategy, where we use a shallow global branch and a deep local branch that operates on the patches of the medical image. This strategy improves the segmentation performance as we do not only operate on the entire image but focus on finer details present in the local patches. Finally, we propose Medical Transformer (MedT), which uses our gated position-sensitive axial attention as the building blocks and adopts our LoGo training strategy.

In summary, this paper (1) proposes a gated position-sensitive axial attention mechanism that works well even on smaller datasets, (2) introduces Local-Global (LoGo) training methodology for transformers which is effective, (3) proposes Medical-Transformer (MedT) which is built upon the above two concepts proposed specifically for medical image segmentation, and (4) successfully improves the performance for medical image segmentation tasks over convolutional networks and fully attention architectures on three different datasets.

2 Medical Transformer (MedT)

2.1 Self-attention Overview

Let us consider an input feature map $x \in \mathbb {R}^{C_{in} \times H \times W}$ with height H, weight W and channels $C_{in}$. The output $y \in \mathbb {R}^{C_{out} \times H \times W}$ of a self-attention layer is computed with the help of projected input using the following equation:

$$\begin{aligned} y_{ij} \ = \ \sum _{h=1}^{H} \sum _{w=1}^{W} {\text {softmax}} \left( q_{ij}^{T} k_{hw}\right) v_{hw}, \end{aligned}$$

(1)

where queries $q=W_Q x$, keys $k=W_K x$ and values $v=W_V x$ are all projections computed from the input x. Here, $q_{ij}, k_{ij}, v_{ij}$ denote query, key and value at any arbitrary location $i \in \{1, \dots , H\}$ and $j \in \{1, \dots , W\}$, respectively. The projection matrices $W_Q, W_K, W_V \in \mathbb {R}^{C_{in} \times C_{out}}$ are learnable. As shown in Eq. 1, the values v are pooled based on global affinities calculated using ${\text {softmax}} (q^T k)$. Hence, unlike convolutions the self-attention mechanism is able to capture non-local information from the entire feature map. However, computing such affinities are computationally very expensive and with increased feature map size it often becomes infeasible to use self-attention for vision model architectures. Moreover, unlike convolutional layer, self-attention layer does not utilize any positional information while computing the non-local context. Positional information is often useful in vision models to capture structure of an object.

Axial-Attention. To overcome the computational complexity of calculating the affinities, self-attention is decomposed into two self-attention modules. The first module performs self-attention on the feature map height axis and the second one operates on the width axis. This is referred to as axial attention [6]. The axial attention consequently applied on height and width axis effectively model original self-attention mechanism with much better computational efficacy. To add positional bias while computing affinities through self-attention mechanism, a position bias term is added to make the affinities sensitive to the positional information [16]. This bias term is often referred to as relative positional encodings. These positional encodings are typically learnable through training and have been shown to have the capacity to encode spatial structure of the image. Wang et al. [22] combined both the axial-attention mechanism and positional encodings to propose an attention-based model for image segmentation. Additionally, unlike previous attention model which utilizes relative positional encodings only for queries, Wang et al. [22] proposed to use it for all queries, keys and values. This additional position bias in query, key and value is shown to capture long-range interaction with precise positional information [22]. For any given input feature map x, the updated self-attention mechanism with positional encodings along with width axis can be written as:

$$\begin{aligned} y_{ij} \ = \ \sum _{w=1}^{W} {\text {softmax}} \left( q_{ij}^{T} k_{iw} + q_{ij}^{T} r^q_{iw} + k_{iw}^{T} r^k_{iw} \right) (v_{iw} + r^v_{iw}), \end{aligned}$$

(2)

where the formulation in Eq. 2 follows the attention model proposed in [22] and $r^q, r^k, r^v \in \mathbb {R}^{W \times W}$ for the width-wise axial attention model. Note that Eq. 2 describes the axial attention applied along the width axis of the tensor. A similar formulation is also used to apply axial attention along the height axis and together they form a single self-attention model that is computationally efficient.

2.2 Gated Axial-Attention

We discussed the benefits of using the axial-attention mechanism proposed in [22] for visual recognition. Specifically, the axial-attention proposed in [22] is able to compute non-local context with good computational efficiency, able to encode positional bias into the mechanism and enables the ability to encode long-range interaction within an input feature map. However, their model is evaluated on large-scale segmentation datasets and hence it is easier for the axial-attention to learn positional bias at key, query and value. We argue that for experiments with small-scale datasets, which is often the case in medical image segmentation, the positional bias is difficult to learn and hence will not always be accurate in encoding long-range interactions. In the case where the learned relative positional encodings are not accurate enough, adding them to the respective key, query and value tensor would result in reduced performance. Hence, we propose a modified axial-attention block that can control the influence positional bias can exert in the encoding of non-local context. With the proposed modification the self-attention mechanism applied on the width axis can be formally written as:

$$\begin{aligned} y_{ij} \ = \ \sum _{w=1}^{W} {\text {softmax}} \left( q_{ij}^{T} k_{iw} + G_Q q_{ij}^{T} r^q_{iw} + G_K k_{iw}^{T} r^k_{iw}\right) ( G_{V1} v_{iw} + G_{V2} r^v_{iw}), \end{aligned}$$

(3)

where the self-attention formula closely follows Eq. 2 with added gating mechanism. Also, $G_Q, G_K, G_{V1}, G_{V2} \in \mathbb {R}$ are learnable parameters and together they create gating mechanism which control influence of the learned relative positional encodings have on encoding non-local context. Typically, if a relative positional encoding is learned accurately, the gating mechanism will assign it high weight compared to the ones which are not learned accurately. Figure 2(c) illustrates the feed-forward in a typical gated axial attention layer.

2.3 Local-Global Training

It is evident that a transformer on patches is faster but patch-wise training alone is not sufficient for the tasks like medical image segmentation. Patch-wise training restricts the network in learning any information or dependencies for inter-patch pixels. To improve the overall understanding of the image, we propose to use two branches in the network, i.e., a global branch which works on the original resolution of the image, and a local branch which operates on patches of the image. In the global branch, we reduce the number of gated axial transformer layers as we observe that the first few blocks of the proposed transformer model is sufficient to model long range dependencies. In the local branch, we create 16 patches of size $I/4 \times I/4$ of the image where I is the dimensions of the original image. In the local branches, each patch is feed forwarded through the network and the output feature maps are re-sampled based on their location to get the output feature maps. The output feature maps of both of the branches are then added and passed through a $1 \times 1$ convolution layer to produce the output segmentation mask. This strategy improves the performance as the global branch focuses on high-level information and the local branch can focus on finer details. The proposed Medical Transformer (MedT) uses gated axial attention layer as the basic building block and uses LoGo strategy for training. It is illustrated in Fig. 2(a). More details on the architecture and an ablation study with regard to the architecture can be found in the supplementary file.

3 Experiments and Results

3.1 Dataset Details

We use Brain anatomy segmentation (ultrasound) [21, 23], Gland segmentation (microscopic) [17] and MoNuSeg (microscopic) [9, 10] datasets for evaluating our method. More details about the datasets can be found in the supplementary.

3.2 Implementation Details

We use binary cross-entropy (CE) loss between the prediction and the ground truth to train our network and can be written as:

$$\begin{aligned} \mathcal {L}_{CE(p,\hat{p})} = - \left( \frac{1}{wh} \sum _{x=0}^{w-1}\sum _{y=0}^{h-1}(p(x,y) \log (\hat{p}(x,y)) ) + (1-p(x,y))\log (1-\hat{p}(x,y))\right) \end{aligned}$$

where w and h are the dimensions of the image, p(x, y) corresponds to the pixel in the image and $\hat{p}(x,y)$ denotes the output prediction at a specific location (x, y). The training details are provided in the supplementary document.

For baseline comparisons, we first run experiments on both convolutional and transformer-based methods. For convolutional baselines, we compare with fully convolutional network (FCN) [1], U-Net [15], U-Net++ [28] and Res-Unet [25]. For transformer-based baselines, we use Axial-Attention U-Net with residual connections inspired from [22]. For our proposed method, we experiment with all the individual contributions. In gated axial attention network, we use axial attention U-Net with all its axial attention layers replaced with the proposed gated axial attention layers. In LoGo, we perform local global training for axial attention U-Net without using the gated axial attention layers. In MedT, we use gated axial attention as the basic building block for global branch and axial attention without positional encoding for local branch.

3.3 Results

Table 1. Quantitative comparison of the proposed methods with convolutional and transformer based baselines in terms of F1 and IoU scores.

Full size table

For quantitative analysis, we use F1 and IoU scores for comparison. The quantitative results are tabulated in Table 1. It can be noted that for datasets with relatively more images like Brain US, fully attention (transformer) based baseline performs better than convolutional baselines. For GlaS and MoNuSeg datasets, convolutional baselines perform better than fully attention baselines as it is difficult to train fully attention models with less data [5]. The proposed method is able to overcome such issue with the help of gated axial attention and LoGo both individually perform better than the other methods. Our final architecture MedT performs better than Gated axial attention, LoGo and all the previous methods. The improvements over fully attention baselines are 0.92 %, 4.76 % and 2.72 % for Brain US, GlaS and MoNuSeg datasets, respectively. Improvements over the best convolutional baseline are 1.32 %, 2.19 % and 0.06 %. All of these values are in terms of F1 scores. For the ablation study, we use the Brain US data for all our experiments. The results for the same has been tabulated in Table 2.

Table 2. Ablation study

Full size table

Furthermore, we visualize the predictions from U-Net [15], Res-UNet [25], Axial Attention U-Net [22] and our proposed method MedT in Fig. 3. It can be seen that the predictions of MedT captures the long range dependencies really well. For example, in the second row of Fig. 3, we can observe that the small segmentation mask highlighted on red box goes undetected in all the convolutional baselines. However, as fully attention model encodes long range dependencies, it learns to segment well thanks to the encoded global context. In the first and fourth row, other methods make false predictions at the highlighted regions as those pixels are in close proximity to the segmentation mask. As our method takes into account pixel-wise dependencies that are encoded with gating mechanism, it is able to learn those dependencies better than the axial attention U-Net. This makes our predictions more precise as they do not miss-classify pixels near the segmentation mask.

4 Conclusion

In this work, we explored the use of transformer-based architectures for medical image segmentation. Specifically, we propose a gated axial attention layer which is used as the building block for multi-head attention models. We also proposed a LoGo training strategy to train the image in both full resolution as well in patches. The global branch helps learn global context features by modeling long-range dependencies, where as the local branch focus on finer features by operating on patches. Using these, we propose MedT (Medical Transformer) which has gated axial attention as its main building block for the encoder and uses LoGo strategy for training. Unlike other transformer-based model the proposed method does not require pre-training on large-scale datasets. Finally, we conduct extensive experiments on three datasets where we achieve a good performance for MedT over ConvNets and other related transformer-based architectures.

References

Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Article Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062 (2014)
Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_49
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019)
Huang, H., et al.: Unet 3+: a full-scale connected unet for medical image segmentation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1055–1059. IEEE (2020)
Google Scholar
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 603–612 (2019)
Google Scholar
Kumar, N., et al.: A multi-organ nucleus segmentation challenge. IEEE Trans. Med. Imaging 39(5), 1380–1391 (2019)
Article Google Scholar
Kumar, N., Verma, R., Sharma, S., Bhargava, S., Vahadane, A., Sethi, A.: A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans. Med. Imaging 36(7), 1550–1560 (2017)
Article Google Scholar
Li, X., Chen, H., Qi, X., Dou, Q., Fu, C.W., Heng, P.A.: H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE Trans. Med. Imaging 37(12), 2663–2674 (2018)
Article Google Scholar
Mehta, S., Mercan, E., Bartlett, J., Weaver, D., Elmore, J.G., Shapiro, L.: Y-net: joint segmentation and classification for diagnosis of breast biopsy images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 893–901. Springer (2018)
Google Scholar
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)
Google Scholar
Oktay, O., et al.: Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468 (2018)
Google Scholar
Sirinukunwattana, K., et al.: Gland segmentation in colon histology images: the glas challenge contest. Med. Image Anal. 35, 489–502 (2017)
Article Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877 (2020)
Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.: Kiu-net: overcomplete convolutional architectures for biomedical image and volumetric segmentation. arXiv preprint arXiv:2010.01663 (2020)
Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.: KiU-Net: towards accurate segmentation of biomedical images using over-complete representations. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12264, pp. 363–373. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59719-1_36
Chapter Google Scholar
Valanarasu, J.M.J., Yasarla, R., Wang, P., Hacihaliloglu, I., Patel, V.M.: Learning to segment brain anatomy from 2d ultrasound with less data. IEEE J. Selected Topics Signal Process. 14(6), 1221–1234 (2020)
Article Google Scholar
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.C.: Axial-deeplab: stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.07853 (2020)
Wang, P., Cuccolo, N.G., Tyagi, R., Hacihaliloglu, I., Patel, V.M.: Automatic real-time cnn-based neonatal brain ventricles segmentation. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 716–719. IEEE (2018)
Google Scholar
Wang, X., Han, S., Chen, Y., Gao, D., Vasconcelos, N.: Volumetric attention for 3D medical image segmentation and detection. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11769, pp. 175–184. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32226-7_20
Chapter Google Scholar
Xiao, X., Lian, S., Luo, Z., Li, S.: Weighted res-unet for high-quality retina vessel segmentation. In: 2018 9th International Conference on Information Technology in Medicine and Education (ITME), pp. 327–331. IEEE (2018)
Google Scholar
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: a nested u-net architecture for medical image segmentation. In: Stoyanov, D., et al. (eds.) DLMIA/ML-CDS -2018. LNCS, vol. 11045, pp. 3–11. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00889-5_1
Chapter Google Scholar

Download references

Acknowledgment

This work was supported by the NSF grant 1910141.

Author information

Authors and Affiliations

Johns Hopkins University, Baltimore, MD, USA
Jeya Maria Jose Valanarasu, Poojan Oza & Vishal M. Patel
Rutgers, The State University of New Jersey, New Brunswick, NJ, USA
Ilker Hacihaliloglu

Authors

Jeya Maria Jose Valanarasu
View author publications
You can also search for this author in PubMed Google Scholar
Poojan Oza
View author publications
You can also search for this author in PubMed Google Scholar
Ilker Hacihaliloglu
View author publications
You can also search for this author in PubMed Google Scholar
Vishal M. Patel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Erasmus MC - University Medical Center Rotterdam, Rotterdam, The Netherlands
Marleen de Bruijne
University of Basel, Allschwil, Switzerland
Philippe C. Cattin
Inria Nancy Grand Est, Villers-lès-Nancy, France
Stéphane Cotin
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Nicolas Padoy
National Center for Tumor Diseases (NCT/UCC), Dresden, Germany
Stefanie Speidel
Tencent Jarvis Lab, Shenzhen, China
Yefeng Zheng
ICube, Université de Strasbourg, CNRS, Strasbourg, France
Caroline Essert

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 93 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M. (2021). Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12901. Springer, Cham. https://doi.org/10.1007/978-3-030-87193-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-87193-2_4
Published: 21 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87192-5
Online ISBN: 978-3-030-87193-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)