Keywords

1 Introduction

Automatic and accurate medical image segmentation, which consists of automated delineation of anatomical structures and other regions of interest (ROIs), plays an integral role in the assessment of computer-aided diagnosis (CAD) [4, 16]. As a flagship of deep learning, convolutional neural networks (CNNs) have scattered existing contributions in various medical image segmentation tasks for many years [21, 24]. Among diverse CNN variants, the widely acknowledged symmetric Encoder-Decoder architecture nomenclature as U-Net [24] has demonstrated eminent segmentation potential. It mainly consists of a series of continuous convolutional and down-sampling layers to capture contextual semantic information through the contracting path. Then in the decoder, using lateral connections from the encoder, the coarse-grained deep features, and fine-grained shallow feature maps are up-sampled to generate a precise segmentation map. Following this technical route, many U-Net variants such as U-Net++ [30] and Res-UNet [29] have emerged to improve the segmentation performance. A paramount caveat of such architectures is the gap of restricted receptive field size, which makes the deep model unable to capture sufficient contextual information, causing the segmentation to fail in complicated areas such as boundaries. To mitigate this problem, the notable DeepLab [5] work was exhibited, triggering broad interest in the image segmentation era. The authors established remarkable contributions which experimentally proved to have substantial practical merit. First, they introduced a novel convolution operation with up-sampled filters called ‘Atrous Convolution’, which allows enlarging the field of view of filters to absorb larger contexts without imposing the burden of the high amount of computation or increasing number of parameters. Second, to incorporate smoothness terms enabling the network to capture fine details, they exploit a fully connected Conditional Random Field (CRF) to refine the segmentation results. Following the pioneering work, extended versions were employed to accommodate further performance boosts. As such, the DeepLabv2 [6] was proposed to conquer the challenge of the existence of objects at multiple scales. To this end, they propose the atrous spatial pyramid pooling (ASPP) module to segment objects at multiple scales robustly. ASPP probes a feature map with multiple atrous convolutions with different sampling rates to obtain multi-scale representation information. Afterward, the DeepLabv3 [7] designed an Encoder-Decoder architecture with atrous convolution to attain sharper object boundaries, where they utilized depth-wise separable convolution to increase computational efficiency. Ultimately, Chen et al. [8] proposed the DeepLabv3+ that extends DeepLabv3 by adding a simple yet effective decoder module to facilitate the segmentation performance. Despite all the efforts, the shortcomings of CNNs are also very prominent as they inevitably have constraints in learning long-range dependency and spatial correlations due to their inductive bias of locality and weight sharing [28] that results in sub-optimal segmentation of complex structures. Recently, the novel architecture Transformer [26] has sparked discussions in computer vision era [11, 12] due to its elegant design and existence of attention mechanism. Indeed, it has been witnessed as capable of learning long-term features and felicitously modeling global information. The pioneering Vision Transformer (ViT) [11] was the major step toward adapting Transformers for vision tasks which accomplished satisfactory results in image classification. It mainly proposed to split the input image into patches and consider them as the source of information for the Transformer module. Despite being feasibly designed, the drawbacks of this scenario are noticeable and profound [3]. First, Transformers impose a quadratic computational load, making it intolerable for dense prediction with high-resolution image tasks. Moreover, despite being a good design choice for capturing explicit global context and long-range relations, Transformers are weak in capturing low-level pixel information, which is indisputably crucial in developing accurate segmentation. Thus, to circumvent the high memory demand in Transformers, the Swin-Transformer [19] proposed a hierarchical ViT with local computing of self-attention with non-overlapping windows, which achieved a linear complexity as opposed to ViT. Recently, faced with the dilemma between efficient CNNs and powerful ViT, crossovers between the two areas have emerged where most try to model a U-Net-like architecture with Transformers. Examples of such are Trans-UNet [4], Swin-UNet [3], and DS-TransUNet [17]. Inspired by the breakthrough performance of DeepLab models with attention mechanism in segmentation tasks [2], in this paper, we propose TransDeepLab, a DeepLab-like pure Transformer for medical image segmentation. Akin to the recently proposed Swin-UNet that models a U-Net structure with a Transformer module, we aim to imitate the seminal DeepLab with Swin-Transformer. The intuition behind our choice is that we intend to facilitate the efficient deployment of Swin-Transformer to restrain the hinder of computational demand of ViT. Moreover, applying the Swin-Transformer module with multiple window sizes can make it a lightweight yet suitable design choice for multi-scale feature fusion, which is a particularly critical equipment in segmentation tasks. In particular, we aim to substitute the ASPP module of the DeepLabv3+ model with the aforementioned hierarchical design. All these lead us to the fact that the proposed TransDeepLab can be the optimal design that is able to efficiently compensate for the mediocre design flaws of DeepLab. The proposed method acquires a significant parameter decrease compared to the cohort study. We will elaborate on the details of our proposal by pinpointing the scope and contributions of this paper in Sect. 2. Our contributions are as follows: (1) By incorporating the advantages of hierarchical Swin-Transformer into the encoder, decoder, and ASPP module of DeepLab, the proposed TransDeepLab can effectively capture long-range and multi-scale representation. (2) The cross-contextual attention to adaptively fuse multi-scale representation. (3) To the best of our knowledge, this work is the first attempt to combine the Swin-Transformer with DeepLab architecture for medical image segmentation.

2 Proposed Method

We propose the TransDeepLab model (Fig. 1), a pure Transformer-based DeepLabv3+ architecture, for medical image segmentation. The network utilizes the strength of the Swin-Transformer block [19] to build hierarchical representation. Following the original architecture of the DeepLab model, we utilize a series of Swin-Transformer blocks to encode the input image into a high-representational space. More specifically, the encoder module splits the input medical image into non-overlapping patches of size \(4 \times 4\), resulting in \(4\ \times \ 4\ \times \ 3=48\) as the feature dimension of each patch (signified as C) and applies the Swin-Transformer block to encode both local semantic and long-range contextual representation. To model Atrous Spatial Pyramid Pooling (ASPP), a pyramid of Swin-Transformer blocks with varying window sizes is designed. The main idea of the Swin pyramid is to capture multi-scale information by exploiting different window sizes. The obtained multi-scale contextual representation is then fused into the decoder module using a Cross-Contextual attention mechanism. The attention block applies two-level attention (e.g., channel and spatial attention) on the tokens (derived from each level of the pyramid) to formulate the multi-scale interaction. Finally, in the decoding path, the extracted multi-scale features are first bilinearly upsampled and then concatenated with the low-level features from the encoder to refine the feature representation. The details of each component of the proposed network will be elaborated on in the subsequent sections.

Fig. 1.
figure 1

The architecture of TransDeepLab, which extends the encoder-decoder structure of DeepLabv3+. Encoder and decoder are all constructed based on Swin-Transformer blocks.

2.1 Swin-Transformer Block

Based on the fact that typical vision Transformers implement the self-attention on a global receptive field, they endure quadratic computational complexity to the number of tokens. To mitigate this, the Swin-Transformer has been devised whose key design characteristic is its shifting of the window partitioner between consecutive self-attention layers constructed by designing a module based on shifted windows as a surrogate for the multi-head self-attention (MSA) module in a Transformer block. Thus, a Swin-Transformer block comprises a shifted window-based MSA module, LayerNorm (LN) layer, a two-layer MLP, and GELU nonlinearity. The window-based multi-head self-attention (W-MSA) module and the shifted window-based multi-head self-attention (SW-MSA) module are applied in the Transformer blocks in tandem. With such shifted window partitioning scheme, consecutive Swin-Transformer blocks can be formulated as:

$$\begin{aligned} \hat{z}^{l}&=\text {W-MSA}\left( \text {LN}\left( z^{l-1}\right) \right) +z^{l-1} \nonumber \\ z^{l}&=\text {MLP}\left( \text {LN}\left( \hat{z}^{l}\right) \right) +\hat{z}^{l} \nonumber \\ \hat{z}^{l+1}&=\text {SW-MSA}\left( \text {LN}\left( z^{l}\right) \right) +z^{l} \nonumber \\ z^{l+1}&=\text {MLP}\left( \text {LN}\left( \hat{z}^{l+1}\right) \right) +\hat{z}^{l+1} , \end{aligned}$$
(1)

where \(\hat{z}^{l}\) and \(z^{l}\) denote the outputs of W-MSA and SW-MSA module of the \(l^{t h}\) block, respectively. Following [13, 14] the self-attention is computed according to:

$$\begin{aligned} \text {Attention}(Q, K, V)=\text {SoftMax}\left( \frac{Q K^{T}}{\sqrt{d}}+B\right) V , \end{aligned}$$
(2)

where \(Q, K, V \in \mathbb {R}^{M^{2} \times d}\) are the query, key and value matrices; d is the query/key dimension, and \(M^{2}\) is the number of patches in a window and B indicates the bias matrix whose values are acquired from \(\hat{B} \in \) \(\mathbb {R}^{(2 M-1) \times (2 M+1)}\).

2.2 Encoder

Inspired by the low computation burden of the Swin-Transformer [19] block (contrary to the quadratic computation of the Vision Transformer [11]) and its strength in modeling long-range contextual dependency (unlike regular CNNs), we model our encoder model using the stacked Swin-Transformer module. Our TransDeepLab encoder first feeds the C-dimensional tokenized input with the resolution of \(\frac{H}{4} \times \frac{W}{4}\) into two successive Swin-Transformer blocks to produce a hierarchical representation while keeping the resolution unchanged. Then, it applies a series of stacked Swin-Transformer blocks to gradually reduce the spatial dimension (similar to a CNN encoder) of the feature map and increase the feature dimension. The resulted mid-level representation is then fed to the Swin Spatial Pyramid Pooling (SSPP) block to capture multi-scale representation.

2.3 Swin Spatial Pyramid Pooling

The spatial resolution of the deep features extracted by the encoder module is considerably decreased due to the stacked Swin-Transformer blocks followed by the patch merging layers (similar to the consecutive down-sampling operation in a CNN encoder). Thus, to compensate for the spatial representation and produce a multi-scale representation, the DeepLab model utilizes an ASPP module, which replaces the pooling operation with atrous convolutions [6]. Concretely, DeepLab aims to form a pyramid representation by applying parallel convolution operations with multiple atrous rates. To model such an operation in a pure Transformer fashion, we create a Swin Spatial Pyramid Pooling (SSPP) block with varying window sizes to capture multi-scale representation. In our design, the smaller window size aims to capture local information while the larger windows are included to extract global information. The resulted multi-scale representation is then fed to a cross-contextual attention module to fuse and capture a generic representation in a non-linear technique.

2.4 Cross-Contexual Attention

In the DeepLabv3+ model, the feature vectors resulting from each level of the pyramid are concatenated and fed to the depthwise separable convolution to perform the fusion operation. This operation performs the convolution for each channel separately and is thus unable to model the channel-wise dependency among pyramid levels. In our design, to model the multi-scale interaction and fuse the pyramid features, we propose a cross-attention module. To this end, we assume that each level of the pyramid (\(\textbf{z}_{m}^{P \times C}\), P and C indicate the number of token and embedding dimension, respectively) represents the object of interest in different scales, thus, by concatenating all these features in a new dimension we create a multi-scale representation \(\textbf{z}_{all}^{P \times MC}=[\textbf{z}_{1} \Vert \textbf{z}_{\text{2 }} ... \Vert \textbf{z}_{\text{ M }}]\), where \(\Vert \) shows the concatenation operation. Next, to adaptively emphasize the contribution of each feature map and surpass the less discriminative features, we propose a scale attention module. Our attention module takes into account the global representation of each channel and applies the MLP layer to produce the scaling coefficients (\(w_{scale}\)) to selectively scale the channel representation among pyramid levels:

$$\begin{aligned} w_{\text{ scale } }=\sigma \left( \textbf{W}_{2} \delta \left( \textbf{W}_{1} G A P_{z_{all}}\right) \right) , z_{all }^{\prime }=w_{\text{ scale } } \cdot z_{all} \end{aligned}$$
(3)

where \(W_1\) and \(W_2\) indicate the learnable MLP parameters and \(\delta \) and \(\sigma \) show the ReLU and Sigmoid activations, and the GAP indicates the global average pooling. In the second attention level, we learn scaling parameters to highlight the informative tokens. To do so, we apply the same strategy:

$$\begin{aligned} w_{\text{ tokens } }=\sigma \left( \textbf{W}_{3} \delta \left( \textbf{W}_{4} G A P_{z_{all}^{\prime }}\right) \right) , z_{all}^{\prime \prime }=w_{\text{ tokens } } \cdot z_{all}^{\prime } \end{aligned}$$
(4)

2.5 Decoder

In the decoder, the acquired features (\(z_{all}^{\prime \prime }\)) corresponding to the attention module are first passed through the Swin-Transformer block with a patch-expanding operation to be upsampled by a factor of 4 and then concatenated with the low-level features. The scheme of concatenating the shallow features and the deep features together helps in reducing the loss of spatial details by the virtue of down-sampling layers. Finally, a series of cascaded Swin-Transformer blocks with path-expanding operations are applied to reach the full resolution of \(H \times W\).

3 Experiments

3.1 Datasets

Synapse Multi-organ Segmentation. This dataset includes 30 abdominal CT scans with 3779 axial contrast-enhanced clinical images in total. Each CT posess volumes in range of 85–198 slices of \(512 \times 512\) pixels, with a voxel spatial resolution of ([0.54–0.54] \(\times \) [0.98–0.98] \(\times \) [2.5–5.0]) mm\(^{3}\). We follow [4] in data partitioning and reporting the quantitative results.

Skin Lesion Segmentation. Our analysis for skin lesion segmentation was based on the ISIC 2017 [10], ISIC 2018 [9] and \(\textrm{PH}^{2}\) [20] datasets. The ISIC datasets were collected by the International Skin Imaging Collaboration (ISIC) as a large-scale dataset of dermoscopy images along with their corresponding ground truth annotations. Furthermore, we exploit the \(\textrm{PH}^{2}\) dataset and pursue the experimental setting used in [18] for splitting the data.

3.2 Implementation Details

Turning to implementation aspects, the proposed TransDeepLab is implemented based on the PyTorch library and trained on a single Nvidia RTX 3090 GPU. We train all of our models upstream using the SGD solver in 200 epochs using a batch size of 24. The softmax Dice loss and cross-entropy loss are employed as objective functions, and L2 Norm is also adopted for model regularization. Rotation and flipping techniques are used as data augmentation methods with the aim of diversifying the training set and obtaining an unbiased training strategy. An initial learning rate of 0.05 with an adaptive decay value is used to train the model. In addition, we use the pre-trained weights on ImageNet for the Swin-Transformer module to initialize their parameters. We embraced a task-specific approach to the scope of evaluation metrics aiming to trigger a fair comparison with respect to each experiment. These metrics include: 1) Dice Similarity Score, 2) Hausdorff Distance, 3) Sensitivity and Specificity, 4) Accuracy.

3.3 Evaluation Results

In this section, we conduct experiments to evaluate our proposed model and compare it with SOTA methods on the two aforementioned medical image segmentation tasks. Notably, we assess TransDeepLab in two distinct ways in our experiments, i.e., quantitative analysis and along with selected visualization results.

Results of Synapse Multi-organ Segmentation. Experiment on Synapse multi-organ CT dataset (Table 1) exhibit the effectiveness and generalization potential of our method, achieving the best performance with segmentation accuracy of \(80.16 \%\,(\textrm{DSC} \uparrow )\) and \(21.25\%\,(\textrm{HD} \downarrow )\). Indicatively, we attain the best performance on Kidney(L) with \(84.08 \%\), Pancreas with \(61.19 \%\), and Stomach with \(78.40 \%\) dice score. A sample of segmentation results of synapse multi-organ is presented in Fig. 2. The organ instances are all detected and classified correctly with slight variations in segmentation contours. Compared to the CNN-based DeepLab model, our approach produces better segmentation results. All in all, these results support our ultimate motivation of modeling both local and global contextual representation with a pure Transformer-based method along with providing a significant performance boost in the field of segmentation, where maintaining rich semantic information is crucial.

Table 1. Comparison results of the proposed method on the Synapse dataset.
Fig. 2.
figure 2

Visualization result of the proposed method on the Synapse dataset.

Results of Skin Lesion Segmentation. The results are summarized in Table 2. Our TransDeepLab performs better than other competitors w.r.t. most of the evaluation metrics. We also show some samples of the skin lesion segmentation obtained by the suggested network in Fig. 3. It is evident from Fig. 3 that TransDeepLab exhibits higher boundary segmentation accuracy together with a performance boost in capturing the fine-grained details.

Table 2. Performance comparison of the proposed method against the SOTA approaches on skin lesion segmentation benchmarks.
Fig. 3.
figure 3

Segmentation results of the proposed method on the skin lesion segmentation.

Model Complexity. Last but not least, we analyze the training parameters of the proposal, as heavy deep nets with small medical image datasets are usually prone to overfitting. TransDeepLab is essentially a lightweight model with only 21.14M parameters. Compared with Swin-UNet [3], the original DeepLab model [6], and Trans-UNet [4] which have 27.17M, 54.70M, and 105M parameters respectively, our lightweight TransDeepLab shows great superiority in terms of model complexity whilest being dominant in terms of evaluation metrics.

3.4 Ablation Study

CNN vs Transformer Encoder. The ablation experiment is conducted to explore the Transformer’s replacement design. In particular, we employed the same decoder and SSPP module as our baseline, but replaced the encoder with a CNN backbone (e.g., ResNet-50) model (denoted as CNN as Encoder in Table 3). Judging from the results of Table 3, we perceive that a solitary CNN-based encoder yields a sub-optimal performance. Literally, the Transformer module indeed helps TransDeepLab to do segmentation to a certain degree.

Attention Strategy. Then, we compared the policy of fusing each level of the Swin-Transformer resulting in multi-scale representation. Concretely, we compare the proposed cross-attention module with a basic scale fusion method, concatenating the feature maps and applying a fully connected layer to fuse them (denoted as Basic Scale Fusion in Table 3). Judging from Table 3, we deduce that the cross-attention module confirms our intuition of capturing the interaction of feature levels in terms of informativeness of the tokens in different scales. Moreover, as for perceptual realism, we have provided sample segmentation results in Fig. 2 which indicate that by using the cross contextual attention mechanism we attain closer to the ground truth results, in line with the real situation. This visualization divulges the effect of a multi-scale Transformer module for long-range contextual dependency learning leading to precise localization abilities, especially in boundary areas, a substantial caveat in the image segmentation.

SSPP Influence. As discussed above, the SSPP module improves the representation ability of the model in context patterning by probing features at multiple scales to attain multi-scale information. We conduct an inquiry into the feature aggregation from adjacent layers of Swin-Transformer assembling the SSPP module with four sets of combinations which explicitly range from 1 to 4 in our experiments. In Table 3 by comparing the results, we can deduce that using a two-level SSPP module mostly leads to dice score performance gains as it assists in handling scale variability in medical image segmentation. Moreover, we perceive that a three-level SSPP module brings along a notable performance in terms of Hausdorff distance. However, to attain more efficiency, the resolution of the input image should be in compliance with the SSPP level, signifying that increasing the number of SSPP levels should follow a higher resolution image. The results also corroborate the propensity of Transformer in incorporating global context information into the model than its CNN counterpart. While one might speculate that thoroughly modeling a CNN-based network using Transformer would cause model complexity, it is worth mentioning that we aim to overcome this issue by exploiting the Swin-Transformer instead of a typical ViT.

Table 3. Ablation study on the impact of modifying modules inside the proposed method. We report our results using the Synapse dataset.

4 Conclusion

In this paper, we present TransDeepLab, a pure Transformer-based architecture for medical image segmentation. Specifically, we model the encoder-decoder DeepLabv3+ model and leverage the potential of Transformers by using the Swin-Transformer as the fundamental component of the architecture. Showcased on a variety of medical image segmentation tasks, TransDeepLab has shown the potential to effectively build long-range dependencies and outperforms other SOTA Vision Transformers in our experiments.