TransDeepLab: Convolution-Free Transformer-Based DeepLab v3+ for Medical Image Segmentation

Azad, Reza; Heidari, Moein; Shariatnia, Moein; Aghdam, Ehsan Khodapanah; Karimijafarbigloo, Sanaz; Adeli, Ehsan; Merhof, Dorit

doi:10.1007/978-3-031-16919-9_9

Reza Azad¹¹,
Moein Heidari¹²,
Moein Shariatnia¹³,
Ehsan Khodapanah Aghdam¹⁴,
Sanaz Karimijafarbigloo¹¹,
Ehsan Adeli¹⁵ &
…
Dorit Merhof^11,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13564))

Included in the following conference series:

International Workshop on PRedictive Intelligence In MEdicine

1482 Accesses
39 Citations

Abstract

Convolutional neural networks (CNNs) have been the de facto standard in a diverse set of computer vision tasks for many years. Especially, deep neural networks based on seminal architectures such as U-shaped model with skip-connections or atrous convolution with pyramid pooling have been tailored to a wide range of medical image analysis tasks. The main advantage of such architectures is that they are prone to detaining versatile local features. However, as a general consensus, CNNs fail to capture long-range dependencies and spatial correlations due to the intrinsic property of confined receptive field size of convolution operations. Alternatively, Transformer, profiting from global information modeling that stems from the self-attention mechanism, has recently attained remarkable performance in natural language processing and computer vision. Nevertheless, previous studies prove that both local and global features are critical for a deep model in dense prediction, such as segmenting complicated structures with disparate shapes and configurations. This paper proposes TransDeepLab, a novel DeepLab-like pure Transformer for medical image segmentation. Specifically, we exploit hierarchical Swin-Transformer with shifted windows to extend the DeepLabv3 and model the Atrous Spatial Pyramid Pooling (ASPP) module. A thorough search of the relevant literature yielded that we are the first to model the seminal DeepLab model with a pure Transformer-based model. Extensive experiments on various medical image segmentation tasks verify that our approach performs superior or on par with most contemporary works on an amalgamation of Vision Transformer and CNN-based methods, along with a significant reduction of model complexity. The codes and trained models are publicly available at github.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Advantages of transformer and its application for medical image segmentation: a survey

Article Open access 03 February 2024

A novel medical image segmentation approach by using multi-branch segmentation network based on local and global information synchronous learning

Article Open access 25 April 2023

HCT-net: hybrid CNN-transformer model based on a neural architecture search network for medical image segmentation

Article 24 March 2023

Keywords

1 Introduction

Automatic and accurate medical image segmentation, which consists of automated delineation of anatomical structures and other regions of interest (ROIs), plays an integral role in the assessment of computer-aided diagnosis (CAD) [4, 16]. As a flagship of deep learning, convolutional neural networks (CNNs) have scattered existing contributions in various medical image segmentation tasks for many years [21, 24]. Among diverse CNN variants, the widely acknowledged symmetric Encoder-Decoder architecture nomenclature as U-Net [24] has demonstrated eminent segmentation potential. It mainly consists of a series of continuous convolutional and down-sampling layers to capture contextual semantic information through the contracting path. Then in the decoder, using lateral connections from the encoder, the coarse-grained deep features, and fine-grained shallow feature maps are up-sampled to generate a precise segmentation map. Following this technical route, many U-Net variants such as U-Net++ [30] and Res-UNet [29] have emerged to improve the segmentation performance. A paramount caveat of such architectures is the gap of restricted receptive field size, which makes the deep model unable to capture sufficient contextual information, causing the segmentation to fail in complicated areas such as boundaries. To mitigate this problem, the notable DeepLab [5] work was exhibited, triggering broad interest in the image segmentation era. The authors established remarkable contributions which experimentally proved to have substantial practical merit. First, they introduced a novel convolution operation with up-sampled filters called ‘Atrous Convolution’, which allows enlarging the field of view of filters to absorb larger contexts without imposing the burden of the high amount of computation or increasing number of parameters. Second, to incorporate smoothness terms enabling the network to capture fine details, they exploit a fully connected Conditional Random Field (CRF) to refine the segmentation results. Following the pioneering work, extended versions were employed to accommodate further performance boosts. As such, the DeepLabv2 [6] was proposed to conquer the challenge of the existence of objects at multiple scales. To this end, they propose the atrous spatial pyramid pooling (ASPP) module to segment objects at multiple scales robustly. ASPP probes a feature map with multiple atrous convolutions with different sampling rates to obtain multi-scale representation information. Afterward, the DeepLabv3 [7] designed an Encoder-Decoder architecture with atrous convolution to attain sharper object boundaries, where they utilized depth-wise separable convolution to increase computational efficiency. Ultimately, Chen et al. [8] proposed the DeepLabv3+ that extends DeepLabv3 by adding a simple yet effective decoder module to facilitate the segmentation performance. Despite all the efforts, the shortcomings of CNNs are also very prominent as they inevitably have constraints in learning long-range dependency and spatial correlations due to their inductive bias of locality and weight sharing [28] that results in sub-optimal segmentation of complex structures. Recently, the novel architecture Transformer [26] has sparked discussions in computer vision era [11, 12] due to its elegant design and existence of attention mechanism. Indeed, it has been witnessed as capable of learning long-term features and felicitously modeling global information. The pioneering Vision Transformer (ViT) [11] was the major step toward adapting Transformers for vision tasks which accomplished satisfactory results in image classification. It mainly proposed to split the input image into patches and consider them as the source of information for the Transformer module. Despite being feasibly designed, the drawbacks of this scenario are noticeable and profound [3]. First, Transformers impose a quadratic computational load, making it intolerable for dense prediction with high-resolution image tasks. Moreover, despite being a good design choice for capturing explicit global context and long-range relations, Transformers are weak in capturing low-level pixel information, which is indisputably crucial in developing accurate segmentation. Thus, to circumvent the high memory demand in Transformers, the Swin-Transformer [19] proposed a hierarchical ViT with local computing of self-attention with non-overlapping windows, which achieved a linear complexity as opposed to ViT. Recently, faced with the dilemma between efficient CNNs and powerful ViT, crossovers between the two areas have emerged where most try to model a U-Net-like architecture with Transformers. Examples of such are Trans-UNet [4], Swin-UNet [3], and DS-TransUNet [17]. Inspired by the breakthrough performance of DeepLab models with attention mechanism in segmentation tasks [2], in this paper, we propose TransDeepLab, a DeepLab-like pure Transformer for medical image segmentation. Akin to the recently proposed Swin-UNet that models a U-Net structure with a Transformer module, we aim to imitate the seminal DeepLab with Swin-Transformer. The intuition behind our choice is that we intend to facilitate the efficient deployment of Swin-Transformer to restrain the hinder of computational demand of ViT. Moreover, applying the Swin-Transformer module with multiple window sizes can make it a lightweight yet suitable design choice for multi-scale feature fusion, which is a particularly critical equipment in segmentation tasks. In particular, we aim to substitute the ASPP module of the DeepLabv3+ model with the aforementioned hierarchical design. All these lead us to the fact that the proposed TransDeepLab can be the optimal design that is able to efficiently compensate for the mediocre design flaws of DeepLab. The proposed method acquires a significant parameter decrease compared to the cohort study. We will elaborate on the details of our proposal by pinpointing the scope and contributions of this paper in Sect. 2. Our contributions are as follows: (1) By incorporating the advantages of hierarchical Swin-Transformer into the encoder, decoder, and ASPP module of DeepLab, the proposed TransDeepLab can effectively capture long-range and multi-scale representation. (2) The cross-contextual attention to adaptively fuse multi-scale representation. (3) To the best of our knowledge, this work is the first attempt to combine the Swin-Transformer with DeepLab architecture for medical image segmentation.

2 Proposed Method

We propose the TransDeepLab model (Fig. 1), a pure Transformer-based DeepLabv3+ architecture, for medical image segmentation. The network utilizes the strength of the Swin-Transformer block [19] to build hierarchical representation. Following the original architecture of the DeepLab model, we utilize a series of Swin-Transformer blocks to encode the input image into a high-representational space. More specifically, the encoder module splits the input medical image into non-overlapping patches of size $4 \times 4$, resulting in $4\ \times \ 4\ \times \ 3=48$ as the feature dimension of each patch (signified as C) and applies the Swin-Transformer block to encode both local semantic and long-range contextual representation. To model Atrous Spatial Pyramid Pooling (ASPP), a pyramid of Swin-Transformer blocks with varying window sizes is designed. The main idea of the Swin pyramid is to capture multi-scale information by exploiting different window sizes. The obtained multi-scale contextual representation is then fused into the decoder module using a Cross-Contextual attention mechanism. The attention block applies two-level attention (e.g., channel and spatial attention) on the tokens (derived from each level of the pyramid) to formulate the multi-scale interaction. Finally, in the decoding path, the extracted multi-scale features are first bilinearly upsampled and then concatenated with the low-level features from the encoder to refine the feature representation. The details of each component of the proposed network will be elaborated on in the subsequent sections.

2.1 Swin-Transformer Block

Based on the fact that typical vision Transformers implement the self-attention on a global receptive field, they endure quadratic computational complexity to the number of tokens. To mitigate this, the Swin-Transformer has been devised whose key design characteristic is its shifting of the window partitioner between consecutive self-attention layers constructed by designing a module based on shifted windows as a surrogate for the multi-head self-attention (MSA) module in a Transformer block. Thus, a Swin-Transformer block comprises a shifted window-based MSA module, LayerNorm (LN) layer, a two-layer MLP, and GELU nonlinearity. The window-based multi-head self-attention (W-MSA) module and the shifted window-based multi-head self-attention (SW-MSA) module are applied in the Transformer blocks in tandem. With such shifted window partitioning scheme, consecutive Swin-Transformer blocks can be formulated as:

$$\begin{aligned} \hat{z}^{l}&=\text {W-MSA}\left( \text {LN}\left( z^{l-1}\right) \right) +z^{l-1} \nonumber \\ z^{l}&=\text {MLP}\left( \text {LN}\left( \hat{z}^{l}\right) \right) +\hat{z}^{l} \nonumber \\ \hat{z}^{l+1}&=\text {SW-MSA}\left( \text {LN}\left( z^{l}\right) \right) +z^{l} \nonumber \\ z^{l+1}&=\text {MLP}\left( \text {LN}\left( \hat{z}^{l+1}\right) \right) +\hat{z}^{l+1} , \end{aligned}$$

(1)

where $\hat{z}^{l}$ and $z^{l}$ denote the outputs of W-MSA and SW-MSA module of the $l^{t h}$ block, respectively. Following [13, 14] the self-attention is computed according to:

$$\begin{aligned} \text {Attention}(Q, K, V)=\text {SoftMax}\left( \frac{Q K^{T}}{\sqrt{d}}+B\right) V , \end{aligned}$$

(2)

where $Q, K, V \in \mathbb {R}^{M^{2} \times d}$ are the query, key and value matrices; d is the query/key dimension, and $M^{2}$ is the number of patches in a window and B indicates the bias matrix whose values are acquired from $\hat{B} \in $ $\mathbb {R}^{(2 M-1) \times (2 M+1)}$.

2.2 Encoder

Inspired by the low computation burden of the Swin-Transformer [19] block (contrary to the quadratic computation of the Vision Transformer [11]) and its strength in modeling long-range contextual dependency (unlike regular CNNs), we model our encoder model using the stacked Swin-Transformer module. Our TransDeepLab encoder first feeds the C-dimensional tokenized input with the resolution of $\frac{H}{4} \times \frac{W}{4}$ into two successive Swin-Transformer blocks to produce a hierarchical representation while keeping the resolution unchanged. Then, it applies a series of stacked Swin-Transformer blocks to gradually reduce the spatial dimension (similar to a CNN encoder) of the feature map and increase the feature dimension. The resulted mid-level representation is then fed to the Swin Spatial Pyramid Pooling (SSPP) block to capture multi-scale representation.

2.3 Swin Spatial Pyramid Pooling

The spatial resolution of the deep features extracted by the encoder module is considerably decreased due to the stacked Swin-Transformer blocks followed by the patch merging layers (similar to the consecutive down-sampling operation in a CNN encoder). Thus, to compensate for the spatial representation and produce a multi-scale representation, the DeepLab model utilizes an ASPP module, which replaces the pooling operation with atrous convolutions [6]. Concretely, DeepLab aims to form a pyramid representation by applying parallel convolution operations with multiple atrous rates. To model such an operation in a pure Transformer fashion, we create a Swin Spatial Pyramid Pooling (SSPP) block with varying window sizes to capture multi-scale representation. In our design, the smaller window size aims to capture local information while the larger windows are included to extract global information. The resulted multi-scale representation is then fed to a cross-contextual attention module to fuse and capture a generic representation in a non-linear technique.

2.4 Cross-Contexual Attention

In the DeepLabv3+ model, the feature vectors resulting from each level of the pyramid are concatenated and fed to the depthwise separable convolution to perform the fusion operation. This operation performs the convolution for each channel separately and is thus unable to model the channel-wise dependency among pyramid levels. In our design, to model the multi-scale interaction and fuse the pyramid features, we propose a cross-attention module. To this end, we assume that each level of the pyramid ($\textbf{z}_{m}^{P \times C}$, P and C indicate the number of token and embedding dimension, respectively) represents the object of interest in different scales, thus, by concatenating all these features in a new dimension we create a multi-scale representation $\textbf{z}_{all}^{P \times MC}=[\textbf{z}_{1} \Vert \textbf{z}_{\text{2 }} ... \Vert \textbf{z}_{\text{ M }}]$, where $\Vert $ shows the concatenation operation. Next, to adaptively emphasize the contribution of each feature map and surpass the less discriminative features, we propose a scale attention module. Our attention module takes into account the global representation of each channel and applies the MLP layer to produce the scaling coefficients ($w_{scale}$) to selectively scale the channel representation among pyramid levels:

$$\begin{aligned} w_{\text{ scale } }=\sigma \left( \textbf{W}_{2} \delta \left( \textbf{W}_{1} G A P_{z_{all}}\right) \right) , z_{all }^{\prime }=w_{\text{ scale } } \cdot z_{all} \end{aligned}$$

(3)

where $W_1$ and $W_2$ indicate the learnable MLP parameters and $\delta $ and $\sigma $ show the ReLU and Sigmoid activations, and the GAP indicates the global average pooling. In the second attention level, we learn scaling parameters to highlight the informative tokens. To do so, we apply the same strategy:

$$\begin{aligned} w_{\text{ tokens } }=\sigma \left( \textbf{W}_{3} \delta \left( \textbf{W}_{4} G A P_{z_{all}^{\prime }}\right) \right) , z_{all}^{\prime \prime }=w_{\text{ tokens } } \cdot z_{all}^{\prime } \end{aligned}$$

(4)

2.5 Decoder

In the decoder, the acquired features ($z_{all}^{\prime \prime }$) corresponding to the attention module are first passed through the Swin-Transformer block with a patch-expanding operation to be upsampled by a factor of 4 and then concatenated with the low-level features. The scheme of concatenating the shallow features and the deep features together helps in reducing the loss of spatial details by the virtue of down-sampling layers. Finally, a series of cascaded Swin-Transformer blocks with path-expanding operations are applied to reach the full resolution of $H \times W$.

3 Experiments

3.1 Datasets

Synapse Multi-organ Segmentation. This dataset includes 30 abdominal CT scans with 3779 axial contrast-enhanced clinical images in total. Each CT posess volumes in range of 85–198 slices of $512 \times 512$ pixels, with a voxel spatial resolution of ([0.54–0.54] $\times $ [0.98–0.98] $\times $ [2.5–5.0]) mm$^{3}$. We follow [4] in data partitioning and reporting the quantitative results.

Skin Lesion Segmentation. Our analysis for skin lesion segmentation was based on the ISIC 2017 [10], ISIC 2018 [9] and $\textrm{PH}^{2}$ [20] datasets. The ISIC datasets were collected by the International Skin Imaging Collaboration (ISIC) as a large-scale dataset of dermoscopy images along with their corresponding ground truth annotations. Furthermore, we exploit the $\textrm{PH}^{2}$ dataset and pursue the experimental setting used in [18] for splitting the data.

3.2 Implementation Details

Turning to implementation aspects, the proposed TransDeepLab is implemented based on the PyTorch library and trained on a single Nvidia RTX 3090 GPU. We train all of our models upstream using the SGD solver in 200 epochs using a batch size of 24. The softmax Dice loss and cross-entropy loss are employed as objective functions, and L2 Norm is also adopted for model regularization. Rotation and flipping techniques are used as data augmentation methods with the aim of diversifying the training set and obtaining an unbiased training strategy. An initial learning rate of 0.05 with an adaptive decay value is used to train the model. In addition, we use the pre-trained weights on ImageNet for the Swin-Transformer module to initialize their parameters. We embraced a task-specific approach to the scope of evaluation metrics aiming to trigger a fair comparison with respect to each experiment. These metrics include: 1) Dice Similarity Score, 2) Hausdorff Distance, 3) Sensitivity and Specificity, 4) Accuracy.

3.3 Evaluation Results

In this section, we conduct experiments to evaluate our proposed model and compare it with SOTA methods on the two aforementioned medical image segmentation tasks. Notably, we assess TransDeepLab in two distinct ways in our experiments, i.e., quantitative analysis and along with selected visualization results.

Results of Synapse Multi-organ Segmentation. Experiment on Synapse multi-organ CT dataset (Table 1) exhibit the effectiveness and generalization potential of our method, achieving the best performance with segmentation accuracy of $80.16 \%\,(\textrm{DSC} \uparrow )$ and $21.25\%\,(\textrm{HD} \downarrow )$. Indicatively, we attain the best performance on Kidney(L) with $84.08 \%$, Pancreas with $61.19 \%$, and Stomach with $78.40 \%$ dice score. A sample of segmentation results of synapse multi-organ is presented in Fig. 2. The organ instances are all detected and classified correctly with slight variations in segmentation contours. Compared to the CNN-based DeepLab model, our approach produces better segmentation results. All in all, these results support our ultimate motivation of modeling both local and global contextual representation with a pure Transformer-based method along with providing a significant performance boost in the field of segmentation, where maintaining rich semantic information is crucial.

Table 1. Comparison results of the proposed method on the Synapse dataset.

Full size table

Results of Skin Lesion Segmentation. The results are summarized in Table 2. Our TransDeepLab performs better than other competitors w.r.t. most of the evaluation metrics. We also show some samples of the skin lesion segmentation obtained by the suggested network in Fig. 3. It is evident from Fig. 3 that TransDeepLab exhibits higher boundary segmentation accuracy together with a performance boost in capturing the fine-grained details.

Table 2. Performance comparison of the proposed method against the SOTA approaches on skin lesion segmentation benchmarks.

Full size table

Model Complexity. Last but not least, we analyze the training parameters of the proposal, as heavy deep nets with small medical image datasets are usually prone to overfitting. TransDeepLab is essentially a lightweight model with only 21.14M parameters. Compared with Swin-UNet [3], the original DeepLab model [6], and Trans-UNet [4] which have 27.17M, 54.70M, and 105M parameters respectively, our lightweight TransDeepLab shows great superiority in terms of model complexity whilest being dominant in terms of evaluation metrics.

3.4 Ablation Study

CNN vs Transformer Encoder. The ablation experiment is conducted to explore the Transformer’s replacement design. In particular, we employed the same decoder and SSPP module as our baseline, but replaced the encoder with a CNN backbone (e.g., ResNet-50) model (denoted as CNN as Encoder in Table 3). Judging from the results of Table 3, we perceive that a solitary CNN-based encoder yields a sub-optimal performance. Literally, the Transformer module indeed helps TransDeepLab to do segmentation to a certain degree.

Attention Strategy. Then, we compared the policy of fusing each level of the Swin-Transformer resulting in multi-scale representation. Concretely, we compare the proposed cross-attention module with a basic scale fusion method, concatenating the feature maps and applying a fully connected layer to fuse them (denoted as Basic Scale Fusion in Table 3). Judging from Table 3, we deduce that the cross-attention module confirms our intuition of capturing the interaction of feature levels in terms of informativeness of the tokens in different scales. Moreover, as for perceptual realism, we have provided sample segmentation results in Fig. 2 which indicate that by using the cross contextual attention mechanism we attain closer to the ground truth results, in line with the real situation. This visualization divulges the effect of a multi-scale Transformer module for long-range contextual dependency learning leading to precise localization abilities, especially in boundary areas, a substantial caveat in the image segmentation.

SSPP Influence. As discussed above, the SSPP module improves the representation ability of the model in context patterning by probing features at multiple scales to attain multi-scale information. We conduct an inquiry into the feature aggregation from adjacent layers of Swin-Transformer assembling the SSPP module with four sets of combinations which explicitly range from 1 to 4 in our experiments. In Table 3 by comparing the results, we can deduce that using a two-level SSPP module mostly leads to dice score performance gains as it assists in handling scale variability in medical image segmentation. Moreover, we perceive that a three-level SSPP module brings along a notable performance in terms of Hausdorff distance. However, to attain more efficiency, the resolution of the input image should be in compliance with the SSPP level, signifying that increasing the number of SSPP levels should follow a higher resolution image. The results also corroborate the propensity of Transformer in incorporating global context information into the model than its CNN counterpart. While one might speculate that thoroughly modeling a CNN-based network using Transformer would cause model complexity, it is worth mentioning that we aim to overcome this issue by exploiting the Swin-Transformer instead of a typical ViT.

Table 3. Ablation study on the impact of modifying modules inside the proposed method. We report our results using the Synapse dataset.

Full size table

4 Conclusion

In this paper, we present TransDeepLab, a pure Transformer-based architecture for medical image segmentation. Specifically, we model the encoder-decoder DeepLabv3+ model and leverage the potential of Transformers by using the Swin-Transformer as the fundamental component of the architecture. Showcased on a variety of medical image segmentation tasks, TransDeepLab has shown the potential to effectively build long-range dependencies and outperforms other SOTA Vision Transformers in our experiments.

References

Asadi-Aghbolaghi, M., Azad, R., Fathy, M., Escalera, S.: Multi-level context gating of embedded collective knowledge for medical image segmentation. arXiv preprint arXiv:2003.05056 (2020)
Azad, R., Asadi-Aghbolaghi, M., Fathy, M., Escalera, S.: Attention Deeplabv3+: multi-level context attention mechanism for skin lesion segmentation. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12535, pp. 251–266. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66415-2_16
Chapter Google Scholar
Cao, H., et al.: Swin-Unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537 (2021)
Chen, J., et al.: TransUNet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv preprint arXiv:1412.7062 (2014)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Chapter Google Scholar
Codella, N., et al.: Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (ISIC). arXiv preprint arXiv:1902.03368 (2019)
Codella, N.C., et al.: Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 168–172. IEEE (2018)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Hatamizadeh, A., et al.: UNETR: transformers for 3D medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 574–584 (2022)
Google Scholar
Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018)
Google Scholar
Hu, H., Zhang, Z., Xie, Z., Lin, S.: Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3464–3473 (2019)
Google Scholar
Lei, B., et al.: Skin lesion segmentation via generative adversarial networks with dual discriminators. Med. Image Anal. 64, 101716 (2020)
Article Google Scholar
Li, S., Sui, X., Luo, X., Xu, X., Liu, Y., Goh, R.: Medical image segmentation using squeeze-and-expansion transformers. arXiv preprint arXiv:2105.09511 (2021)
Lin, A., Chen, B., Xu, J., Zhang, Z., Lu, G.: DS-TransUNet: dual swin transformer U-Net for medical image segmentation. arXiv preprint arXiv:2106.06716 (2021)
Liu, X., Hu, G., Ma, X., Kuang, H.: An enhanced neural network based on deep metric learning for skin lesion segmentation. In: 2019 Chinese Control and Decision Conference (CCDC), pp. 1633–1638. IEEE (2019)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Mendonça, T., Ferreira, P.M., Marques, J.S., Marcal, A.R., Rozeira, J.: PH$^2$ - a dermoscopic image database for research and benchmarking. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5437–5440. IEEE (2013)
Google Scholar
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)
Google Scholar
Oktay, O., et al.: Attention U-Net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
Reza, A., Moein, H., Yuli, W., Dorit, M.: Contextual attention network: transformer meets U-Net. arXiv preprint arXiv:2203.01932 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: gated axial-attention for medical image segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12901, pp. 36–46. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87193-2_4
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Wu, H., Chen, S., Chen, G., Wang, W., Lei, B., Wen, Z.: FAT-Net: feature adaptive transformers for automated skin lesion segmentation. Med. Image Anal. 76, 102327 (2022)
Article Google Scholar
Xie, Y., Zhang, J., Shen, C., Xia, Y.: CoTr: efficiently bridging CNN and transformer for 3D medical image segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12903, pp. 171–180. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87199-4_16
Chapter Google Scholar
Zhang, Z., Liu, Q., Wang, Y.: Road extraction by deep residual U-Net. IEEE Geosci. Remote Sens. Lett. 15(5), 749–753 (2018)
Article Google Scholar
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: UNet++: a nested U-Net architecture for medical image segmentation. In: Stoyanov, D., et al. (eds.) DLMIA/ML-CDS -2018. LNCS, vol. 11045, pp. 3–11. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00889-5_1
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Imaging and Computer Vision, RWTH Aachen University, Aachen, Germany
Reza Azad, Sanaz Karimijafarbigloo & Dorit Merhof
School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran
Moein Heidari
School of Medicine, Tehran University of Medical Sciences, Tehran, Iran
Moein Shariatnia
Department of Electrical Engineering, Shahid Beheshti University, Tehran, Iran
Ehsan Khodapanah Aghdam
Stanford University, Stanford, USA
Ehsan Adeli
Fraunhofer Institute for Digital Medicine MEVIS, Bremen, Germany
Dorit Merhof

Authors

Reza Azad
View author publications
You can also search for this author in PubMed Google Scholar
Moein Heidari
View author publications
You can also search for this author in PubMed Google Scholar
Moein Shariatnia
View author publications
You can also search for this author in PubMed Google Scholar
Ehsan Khodapanah Aghdam
View author publications
You can also search for this author in PubMed Google Scholar
Sanaz Karimijafarbigloo
View author publications
You can also search for this author in PubMed Google Scholar
Ehsan Adeli
View author publications
You can also search for this author in PubMed Google Scholar
Dorit Merhof
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reza Azad .

Editor information

Editors and Affiliations

Istanbul Technical University, Istanbul, Turkey
Islem Rekik
Stanford University, Stanford, CA, USA
Ehsan Adeli
Daegu Gyeongbuk Institute of Science and Technology, Daegu, Korea (Republic of)
Sang Hyun Park
IBM Research - Africa, Nairobi, Kenya
Celia Cintas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Azad, R. et al. (2022). TransDeepLab: Convolution-Free Transformer-Based DeepLab v3+ for Medical Image Segmentation. In: Rekik, I., Adeli, E., Park, S.H., Cintas, C. (eds) Predictive Intelligence in Medicine. PRIME 2022. Lecture Notes in Computer Science, vol 13564. Springer, Cham. https://doi.org/10.1007/978-3-031-16919-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-16919-9_9
Published: 16 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16918-2
Online ISBN: 978-3-031-16919-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

TransDeepLab: Convolution-Free Transformer-Based DeepLab v3+ for Medical Image Segmentation

Abstract

Similar content being viewed by others

Advantages of transformer and its application for medical image segmentation: a survey

A novel medical image segmentation approach by using multi-branch segmentation network based on local and global information synchronous learning

HCT-net: hybrid CNN-transformer model based on a neural architecture search network for medical image segmentation

Keywords

1 Introduction