1 Introduction

Medical image segmentation, as one of the critical computer-aided medical image analysis problems, aims to capture precisely the shapes and volumes of target organs and tissues by pixel-wise classification, obtaining clinically useful information for diagnosis, treatment, and intervention. With the recent development of deep learning methods and computer vision algorithms, medical image segmentation has been revolutionized and remarkable progresses have been achieved (e.g., automatic liver and tumor lesion segmentation [1], brain tumor segmentation [2], and multiple sclerosis (MS) lesion segmentation [3]).

Fully convolutional network (FCN) [4] was first proved effective for general image segmentation tasks, which became a predominant technique for medical image segmentation [5,6,7,8,9]. However, it was observed that vital details can be missing with the decrease of feature map sizes when FCN models went deeper. To this end, a family of U-shaped networks [10,11,12,13,14,15,16,17] was proposed to extend the sequential FCN frameworks to encoder–decoder-type architectures, alleviating the spatial information loss using skip connections. In DeepLab models [18,19,20,21], atrous convolutions instead of pool layers were applied to expand the receptive field and fully connected conditional random field (CRF) was introduced to maintain fine details. Although these CNN-based methods have achieved great performances on medical image segmentation tasks, they still suffered from limited receptive fields and were unable to capture long-range dependencies, leading to sub-optimal accuracy and failing to meet the needs of various medical image segmentation scenarios.

Inspired by the success of Transformer with its self-attention mechanism in natural language processing (NLP) tasks [22, 23], researchers tried to adapt Transformers [24,25,26,27] to computer vision (CV) in order to compensate the locality of CNNs. The self-attention mechanism in Transformers enabled to compute pair-wise relations between patches globally, consequently achieving feature interactions across a long range. The self-attention mechanism was first adopted by non-local neural networks [28] to complement CNNs for modeling pixel-level long-range dependency for visual recognition tasks. Then, a pure Transformer framework was proposed by the Vision Transformer (ViT) [24] for vision tasks, treating an image as a collection of spatial patches. Recently, Transformers have achieved excellent outcomes on a variety of vision tasks [29,30,31,32,33,34], including image recognition [29,30,31, 35, 36], semantic segmentation [32], and object detection [33, 34]. On semantic medical image segmentation, Transformer-combined architectures can be divided into two categories: The main one adopted self-attention like operations to complement CNNs [37,38,39,40] and the other used pure Transformers to constitute encoder–decoder architectures so as to capture deep representations and predict the class of each image pixel [32, 41,42,43].

Although the above medical image segmentation methods were promising and yielded good performance to some extent, they still suffered considerable drawbacks. (1) The majority of these Transformer segmentation models were designed for 2D images [37, 39, 41,42,43]. For 3D medical images (e.g., 3D MRI scans), they divided the input images into 2D slices and processed the individual slices with the 2D models, which could lose useful 3D contextual information. (2) Compared with common 2D natural scene images, processing 3D medical images inevitably incurred larger model sizes and computational costs, especially when computing global feature interactions with self-attention in vanilla Transformer [22] (see more details in Sect. 3.3). Although some adaptations were proposed to reduce the operation scopes of self-attention [29,30,31, 44,45,46,47] (e.g., progressive scaling pyramids were used in the Pyramid Vision Transformer [30] to reduce the computation costs of large feature maps), insufficient global information fusion incurred. (3) The self-attention operation in Transformers was shown to be permutation-equivalent [22], which omitted the order of patches in an input sequence. However, the permutation equivalence nature can be detrimental to medical image segmentation since segmentation results are often highly position-correlated. In prior works, absolute position encoding (APE) [22] and relative position encoding (RPE) [29, 48] were utilized to supplement position information. But, APE required a pre-given and fixed patch amount and thus failed to generalize to different image sizes, while RPE ignored the absolute position information that could be a vital cue in medical images (e.g., the positions of bones are often relatively stable).

To address the above drawbacks, we propose a new efficient model called Dilated Transformer (D-Former) to directly process 3D medical images (instead of dealing with 2D slices of 3D images independently) and predict volumetric segmentation masks. Our proposed D-Former is a 3D U-shaped architecture with hierarchical layers, and employs skip connections from encoder to decoder following [10,11,12,13,14,15,16,17]. This model’s stem is constructed with eight D-Former blocks, each of which consists of several local scope modules (LSMs) and global scope modules (GSMs). The LSM conducts self-attention locally, focusing on fine information capturing. The GSM performs global self-attention on uniformly sampled patches, aiming to explore rough and long-range-dependent information at low cost. The LSMs and GSMs are arranged in an alternate manner to achieve local and global information interaction. For drawback (3), we manage to incorporate position information among patches in a more dynamic manner. Inspired by [44, 49], we utilize depth-wise convolutions [50] to learn position information, which can help provide useful position cues in medical image segmentation.

Benefiting from these designs, our proposed D-Former model could be more suitable for medical image segmentation tasks and yield better segmentation accuracy. The main contributions of this work are as follows:

  1. (1)

    We construct a 3D Transformer-based architecture which allows to process volumetric medical images as a whole and thus spatial information along the depth dimension of 3D medical images can be fully captured.

  2. (2)

    We design local scope modules (LSMs) and global scope modules (GSMs) to increase the scopes of information interactions without increasing the patches involved in computing self-attention, which helps reduce computational costs.

  3. (3)

    To further incorporate relative and absolute position information among patches, we apply a dynamic position encoding method to learn it from the input directly. As a result, an inherent problem of common Transformers, permutation equivalence [22], could be considerably alleviated.

  4. (4)

    Extensive experimental evaluations show that our model outperforms state-of-the-art segmentation methods in different domains (e.g., CT and MRI), with smaller model sizes and less FLOPs than the known methods.

2 Related work

2.1 CNN-based segmentation networks

Since the advent of the seminal U-Net model [10], many CNN-based networks have been developed [17, 51,52,53,54]. As for the design of skip connections, U-Net++ [55] and U-Net3+ [56] were proposed to attain dense connections between encoder and decoder. In addition, regarding the locality of CNNs, different kinds of mechanisms were designed to enlarge the receptive field, such as larger kernel [57], dilated convolution module [58, 59], pyramid pooling module [60, 61], and deformable convolution module [62, 63]. In particular, dilated convolution was an ingenious design in which the convolution kernel was expanded by inserting holes between its consecutive elements. This design has been adopted by various segmentation models, achieving good performance compared with the original convolution-based methods. Our Dilated Transformer also obtains a key idea from this design and aims to conduct self-attention in a patch skipping manner (see Sect. 3.3 for details).

2.2 Visual Transformer variants

Transformer and its self-attention mechanism were first designed for sequence modeling and transduction tasks in the domain of natural language processing (NLP), achieving state-of-the-art performance [22, 23]. Inspired by tremendous success of Transformers in NLP, Transformers were adapted for computer vision tasks. The first attempt was vision Transformer (ViT) [24] which needed huge pre-training datasets. To overcome this weakness, a wide range of training strategies with knowledge distillation was proposed by DeiT [25], which contributed to better performances of vanilla Transformer. There were different kinds of adaptations for vanilla Transformer, such as Swin Transformer [29], pyramid vision Transformer [30], Transformer in Transformer [64], and aggregating nested Transformers [31]. In particular, Swin Transformer showed great success in various computer vision tasks with its elegant shift window mechanism and hierarchical architecture. Our proposed D-Former is inspired by Swin Transformer’s local–global combining scopes of information interactions.

2.3 Transformers for segmentation tasks

As mentioned above, Transformers used in medical image segmentation methods can be divided into two categories. In the main category, Transformer and its self-attention mechanism were utilized as a supplement for the convolution-based stem. SETR [65] was proposed to apply Transformer as encoder to extract features for segmentation tasks. In medical images, many models with Transformers focused on segmentation tasks. In TransUNet [37], convolutional layer was used as a feature extractor to obtain detailed information from raw images, and the generated feature maps were then put into Transformer layer to obtain global information. UNETR [38] proposed a 3D Transformer-combining architecture for medical images, which treated Transformer layer as encoder to extract features and convolutional layer as decoder. A great amount of such work focused on taking advantage of both Transformer’s long-range dependency and CNN’s inductive bias. In the other category, Transformer was regarded as the main stem for building the whole architecture [32, 41,42,43]. In MedT [66], the gated axial attention mechanism and Gated Axial Transformer layer were proposed to build the architecture. Swin-Unet [41] was constructed with the basic units of Swin Transformer blocks, which was further extended in DS-TransUNet [42] by adding another encoder pathway for input of different sizes. Compared with these previous methods, our proposed D-Former model has several advantages: (1) Our method focuses on 3D medical image segmentation, which is a topic with little previous exploration in the context of Transformer; (2) our D-Former avoids cumbersome design for fusing CNN and Transformer specifically, constructing the architecture stem based on Transformer only; and (3) by designing LSMs and GSMs (see Sect. 3.3), our model complexity is significantly lower than the compared methods.

Fig. 1
figure 1

Overall architecture of our D-Former model. Each D-Former block is constructed with one dynamic position encoding block (DPE) and several local scope modules (LSMs) and global scope modules (GSMs). The input size of the D-Former block i is reported sideward, and the output sizes are the same as the corresponding input sizes. The values in round brackets denote the numbers of patches, which are regarded as one dimension when computed in Transformers (i.e., \((\frac{W}{4} \times \frac{H}{4} \times \frac{D}{2})\), \((\frac{W}{8} \times \frac{H}{8} \times \frac{D}{4})\), \((\frac{W}{16} \times \frac{H}{16} \times \frac{D}{8})\), \((\frac{W}{32} \times \frac{H}{32} \times \frac{D}{16})\))

3 Method

3.1 The overall architecture

Our proposed D-Former model is outlined in Fig. 1, which is a hierarchical encoder–decoder architecture. The encoder pathway consists of one patch embedding layer for transforming 3D images into sequences and four proposed D-Former blocks for feature extraction with three down-sampling layers in between them. The first, second, and fourth D-Former blocks each consist of one local scope module (LSM) and one global scope module (GSM), respectively, while the third D-Former block has three LSMs and three GSMs, in which the LSMs and GSMs are arranged in an alternate manner. The decoder pathway is symmetric to the encoder pathway, which also has four D-Former blocks, three up-sampling layers, and one patch expanding layer. In addition, skip connections are used to transfer information from the encoder to the decoder at the corresponding levels. The feature maps from the encoder are concatenated with the corresponding feature maps along the channel dimension, which may compensate for the loss of fine-grained information as the model goes deep.

In this section, we will present the components of D-Former one by one, including the patch embedding and patch expanding layers (Sect. 3.2), the D-Former block and its major modules, the local scope module and global scope module (Sect. 3.3), the down-sampling and up-sampling operations (Sect. 3.4), and the dynamic position encoding block (Sect. 3.5).

3.2 Patch embedding and patch expanding

Similar to common Transformers in computer vision, after data augmentation, an input 3D medical image \(x \in {\mathbb {R}}^{W \times H \times D}\) is first processed by a patch embedding layer and is divided into a series of patches of size \(4\times 4\times 2\) each, and then is projected into C channel dimensions by linear projection to yield a feature map (denoted by \(x_1\)) of size \((\frac{W}{4} \times \frac{H}{4} \times \frac{D}{2}) \times C\), where \((\frac{W}{4} \times \frac{H}{4} \times \frac{D}{2})\) denotes the number of patches and C is the number of the channel dimensions. Hence, the input 3D image is reorganized as a sequence (of length \((\frac{W}{4} \times \frac{H}{4} \times \frac{D}{2})\)) and can be directly fed to a Transformer architecture. The final patch expanding layer is used to restore the feature map to the original input size, and a segmentation head (like 3D UNet [67]) is utilized to attain pixel-wise segmentation masks.

3.3 D-former blocks

After patch embedding, \(x_{1}\) is directly fed to D-Former block 1. In the processing by Transformer block 1, \(x_1\) is first processed by a new dynamic position encoding block that embeds position information into feature maps (see details in Sect. 3.5), and then it is operated by the Local Scope Module (LSM) and Global Scope Module (GSM) alternatively to extract higher-level features. The other D-Former Blocks process the corresponding input features similarly, and the feature map sizes are provided in Fig. 1.

Fig. 2
figure 2

Local scope module (LSM) and global scope module (GSM), which should be arranged in pair to combine local and global information

3.3.1 Local scope module and global scope module

The local scope module (LSM) and global scope module (GSM) are designed to capture local and global features, respectively, for which two different self-attention operations are employed, called local scope multi-head self-attention (LS-MSA) and global scope multi-head self-attention (GS-MSA). As shown in Fig. 2, an LSM is composed of a LayerNorm layer [68], a proposed LS-MSA, another LayerNorm layer, and a multilayer perceptron (MLP), in sequence, with two residual connections to prevent gradient vanishing [22]. In a GSM, the LS-MSA is replaced by a proposed GS-MSA, and the other components are kept the same as the LSM. To allow local features and global features to be captured and fused well, LSM and GSM are arranged alternatively in each D-Former block. With these components, their operations are formally defined as:

$$\begin{aligned} {\hat{z}}^{l}= &\, {} \text {LS-MSA}\left( \text {LN}\left( z^{l-1}\right) \right) +z^{l-1}, \end{aligned}$$
(1)
$$\begin{aligned} z^{l}= & \,{} \text {MLP}\left( \text {LN}\left( {\hat{z}}^{l}\right) \right) +{\hat{z}}^{l}, \end{aligned}$$
(2)
$$\begin{aligned} {\hat{z}}^{l+1}= &\, {} \text {GS-MSA}\left( \text {LN}\left( z^{l}\right) \right) +z^{l}, \end{aligned}$$
(3)
$$\begin{aligned} z^{l+1}= &\, {} \text {MLP}\left( \text {LN}\left( {\hat{z}}^{l+1}\right) \right) +{\hat{z}}^{l+1}, \end{aligned}$$
(4)

where \({\hat{z}}^{l}\) and \(z^{l}\) denote the outputs of LS-MSA and the corresponding MLP, respectively, and \({\hat{z}}^{l+1}\) and \(z^{l+1}\) denote the outputs of GS-MSA and the corresponding MLP, respectively.

Fig. 3
figure 3

a Local scope multi-head self-attention: The self-attention is conducted in a local unit (colored in blue) where the patches are adjacent. b Gobal scope multi-head self-attention: The self-attention is conducted in a global unit (colored in blue) where patches are picked every gth patch across the feature map. A small cube represents one patch. The feature map size is set as \(6\times 6\times 6\) and the unit size is \(3\times 3\times 3\) as an example. We color only the patches of one unit in blue for illustration; the other gray patches are also utilized to construct seven other units in both LS-MSA and GS-MSA

3.3.2 Local scope multi-head self-attention (LS-MSA)

Self-attention is conducted in the vanilla Transformer in a global scope in order to capture pair-wise relationships between patches, leading to quadratic complexity with respect to the number of patches. However, due to the fact that 3D medical images would increase computation inevitably, this original self-attention would not be suitable for 3D medical image related tasks, especially for semantic segmentation with dense prediction targets. Under such circumstances, as illustrated in Fig. 3a, a whole feature map is first divided evenly into non-overlapping units (the number of patches in each unit is denoted by \(u_{d} \times u_{h} \times u_{w}\), where \(u_d\) denotes the number of patches in one unit along the depth dimension D, \(u_h\) along the height dimension H, and \(u_w\) along the width dimension W), and self-attention is conducted within each unit. In this way, the computational complexity will be reduced to linear in terms of the number of patches in the whole feature map. The computational complexity (\(\Omega\)) of these two different self-attention mechanisms is computed as:

$$\begin{aligned} \Omega (\text {MSA})= &\, {} 4 d h w C^{2}+2(d h w)^{2} C, \end{aligned}$$
(5)
$$\begin{aligned} \Omega (\text {LS-MSA})= &\, {} 4 d h w C^{2}+2 u_{d} u_{h} u_{w} d h w C, \end{aligned}$$
(6)

where \(u_{d} u_{h} u_{w}\) denotes the number of patches in one unit and dhw denotes the number of patches in the whole feature map. (d, h, and w denote the depth, height, and width of the feature map, respectively.) In most cases, \(u_{d} u_{h} u_{w}\ll dhw\). The Softmax operation is omitted when computing the computational complexity.

3.3.3 Global scope multi-head self-attention (GS-MSA)

The LS-MSA performs self-attention only within each local unit, which lacks global information interaction and long-range dependency. To address this issue, we design a global scope multi-head self-attention mechanism to attain information interaction across different units in a dilated manner. As illustrated in Fig. 3b, for a whole feature map, we pick one patch every g distance along each dimension and form a unit with all the patches thus picked, on which self-attention would then be conducted. Likewise, we pick the other patches to form new units, until all the patches are utilized. Hence, the receptive field in computing self-attention will be enlarged but the number of patches involved will not be increased, which means that it would not increase the computational cost while getting access to long-range information interaction. To keep consistency between LSM and GSM, we set \(d=g_{d} \times u_{d}\), \(h=g_{h} \times u_{h}\), and \(w=g_{w} \times u_{w}\), which ensures that the numbers of units in LSM and GSM are kept the same. Here, \(d \times h \times w\) denotes the number of patches in the whole feature map, \(u_{d} \times u_{h} \times u_{w}\) denotes the number of patches in one unit, and \(g_{d}\), \(g_{h}\), and \(g_{w}\) denote the distance between two nearest patches picked along the depth dimension D, height dimension H, and width dimension W, respectively.

3.4 Down-sampling and up-sampling

Between every two adjacent D-Former blocks of the encoder, a down-sampling layer is utilized to merge patches for further feature fusion. Specifically, a down-sampling layer concatenates the feature maps of \(2\times 2\times 2\) neighboring patches (2 neighboring patches along the width, height, and depth dimensions, respectively), reducing the number of patches by 8 times. Then, a fully connected layer is utilized to reduce the feature channel size by 4 times to ensure that the channel size can be doubled after each down-sampling layer. Thus, the output feature maps of each down-sampling layer will be \(x_{2} \in {\mathbb {R}}^{(\frac{W}{8} \times \frac{H}{8} \times \frac{D}{4}) \times 2C}\), \(x_{3} \in {\mathbb {R}}^{(\frac{W}{16} \times \frac{H}{16} \times \frac{D}{8}) \times 4C}\), and \(x_{4} \in {\mathbb {R}}^{(\frac{W}{32} \times \frac{H}{32} \times \frac{D}{16}) \times 8C}\), respectively. In reverse to the down-sampling layers, four up-sampling layers of the decoder are used to enlarge the low-resolution feature maps and reduce the number C of the channel dimensions. In this way, our model will be able to extract features in a multi-scale manner and yield better segmentation accuracy.

3.5 The dynamic position encoding block

The depth-wise convolution (DW-Conv) is a type of convolution that applies a single convolutional filter for each input channel instead of for all channels as in a common convolution, which can decrease the computational cost. We apply 3D depth-wise convolution [50] to the input feature maps (or images) once in every D-Former block to learn position information. Then the learned position information will be added to the original input \(x_i\) as:

$$\begin{aligned} x_i^\prime = \text {Resize}(\text {DW-Conv}(\text {Resize}(x_i)))+x_i, \end{aligned}$$
(7)

where \(x_i\) denotes the input feature maps of the ith D-Former block and \(x_i^\prime\) denotes the output feature maps embedded with position information. Resize is used to adjust the dimensions of feature maps \(x_i\) to cater the input need of DW Convolution.

In this way, position information among patches can be extracted by a DW-Convolution. Given the fact that position information could be dynamically learned based on the input x itself, a drawback in the previous work that requires a fixed number of patches can be avoided. In addition, the convolution’s inherent nature of translation invariance can be utilized to increase the stability and generalization performance [69].

4 Experiments

4.1 Datasets

The Synapse multi-organ segmentation (Synapse) dataset includes 30 axial contrast-enhanced abdominal CT scans. Following the training–test split in [37], 18 of the 30 scans are used for training and the remaining ones are for testing. The average dice similarity coefficient (DSC) [17] is used as the measure for evaluating the segmentation performances of the eight target organs, including aorta, gallbladder, kidney (L), kidney (R), liver, pancreas, spleen, and stomach.

The Automated Cardiac Diagnosis Challenge (ACDC) dataset contains 150 magnetic resonance imaging (MRI) 3D cases collected from different patients, and each case covers a heart organ from the base to the apex of the left ventricle. Following the setting in [37], only 100 well-annotated cases are used in the experiments, and the training, validation, and test data are partitioned with the ratio of 7: 1: 2. For fair comparison, the average DSC is employed to evaluate the segmentation performances following the previous work [37], and three key parts of the heart are chosen as targets, including the right ventricle (RV), myocardium (Myo), and left ventricle (LV).

4.2 Implementation setup

Pre-training. Our D-Former model is trained from scratch, which means that we initialize the model’s weights randomly. Note that in common practice, pre-training is important to Transformer-based models. This is because the pre-training process provides generalized representations and prior knowledge for downstream tasks. For example, in vision Transformer (VIT) [24], it considered that the model performance depends heavily on pre-training, and its experiments verified this view. Besides, lots of known medical image segmentation methods used pre-trained weights to initialize their models [32, 37, 39, 70, 71]. However, the pre-training process of Transformer-based models brings up two issues. First, the pre-training process usually incurs high computational complexity in terms of time or computation consumed. Second, for medical images, there are few complete and acknowledged sizable datasets for pre-training (in comparison, ImageNet [72] is available for natural scene images), and the domain gap between natural images and medical images makes it hard for medical image segmentation models to use existing large natural image datasets directly. For these reasons, we choose to train our D-Former model from scratch, which nevertheless yields promising performance that surpasses state-of-the-art methods with pre-training.


Implementation details. Our proposed D-Former is implemented on PyTorch 1.8.0, and all the experiments are trained on an NVIDIA GeForce RTX 3090 GPU with 24 GB memory. The batch size during training is 2 and during inference in 1. The SGD optimizer [73] with momentum 0.99 is used. The initial learning rate is 0.01 with weight decay of 3e–5. The polylearning rate strategy [74] is utilized with the maximum training epochs of 3000 for the Synapse dataset and 1500 for the ACDC dataset. The training takes about 8 h for the Synapse dataset and about 6.5 h for the ACDC dataset, and the test time of one sample takes about 1.3 s for the Synapse dataset and about 1.2 s for the ACDC dataset.


Loss function. The cross-entropy loss and Dice loss are both widely used for general segmentation tasks. However, since the cross-entropy loss is apt to perform well for uniform class distribution while Dice loss is more suitable for target objects of large sizes [75], each of them alone may not be effective for medical image segmentation tasks that involve imbalanced classes and target objects of small sizes. Thus, our loss function combines the binary cross-entropy loss Y [76] and Dice loss \({\hat{Y}}\) [17] together, which is defined as:

$$\begin{aligned} {\mathcal {L}}(Y, {\hat{Y}})=-\frac{1}{N} \sum _{n=1}^{N}\left( \frac{1}{2} \cdot Y_{n} \cdot \log {\hat{Y}}_{n}+\frac{2 \cdot Y_{n} \cdot {\hat{Y}}_{n}}{Y_{n}+{\hat{Y}}_{n}}\right) \end{aligned}$$
(8)

where \(Y_{n}\) and \({\hat{Y}}_{n}\) denote the ground truth and predicted probabilities of the \(n^{th}\) image, respectively, and N is the batch size.

Table 1 Segmentation performances of different methods on the synapse dataset (average dice similarity coefficient (DSC)) 
Table 2 Segmentation performances of different methods on the ACDC dataset (average dice similarity coefficient (DSC))

4.3 Quantitative results

We evaluate the performance of our proposed D-Former model on the Synapse and ACDC datasets, and compare with various state-of-the-art models, including V-Net [17], DARR [77], R50 U-Net [10], R50 Att-UNet [78], U-Net [10], Att-UNet [78], VIT [24], R50 VIT [24], TransUNet [37], Swin-UNet [41], LeVit-Unet-384 [71], nnFormer [32], and MISSFormer [43].

Quantitative results on the Synapse dataset are reported in Table 1, which show that our method outperforms the previous work by a clear margin. It is notable that the concurrent Transformer-based methods nnFormer and MISSFormer achieve some performance gains compared to the CNN-based methods, while our method still brings further improvement in the average DSC by 0.0143 compared to nnFormer and by 0.0687 compared to MISSFormer. Besides, our D-Former obtains accuracy improvement on almost every organ class, except for the pancreas and stomach, which verifies that our D-Former is a promising and robust framework.

Quantitative results on the ACDC dataset are reported in Table 2, and a similar conclusion can be drawn. D-Former achieves the best average DSC of 0.9229 without pre-training. Compared with the other methods, our method brings improvements in the average DSC by 0.0474 over R50 U-Net and by 0.0554 over R50 Att-UNet. Compared to the concurrent Transformer-based methods, our method still achieves 0.0051 performance gain over nnFormer and 0.0439 over MISSFormer in the average DSC. Specifically, among all the key parts of the heart, including the right ventricle (RV), myocardium (Myo), and left ventricle (LV), our D-Former obtains the best segmentation accuracy compared to the other methods in the average DSC.

The results in Tables 1 and 2 show that our D-Former attains excellent generalization on both CT data and MRI data, outperforming the previous methods. Notably, different from most of the known Transformer-based frameworks that require a pre-training process, D-Former is initialized randomly and is trained from scratch, yet still obtains competitive performances. This implies that our model could be more suitable for medical imaging tasks when general large size medical image pre-training datasets (such as ImageNet [72] for natural scene images) are lacking.

To verify the statistical difference between our proposed method and other compared methods, Sign test [79] and Paired T-test [80] are conducted. (1) For the Synapse dataset, the Sign test is conducted between our method and other compared methods one by one, where the inputs are the organ’s segmentation accuracy (i.e., average DSC) of two paired groups (i.e., one is our method and the other is one compared method). In comparing our proposed method with nnFormer, the output p-value is 0.09, and p-values are 0.02 between our proposed method and other methods. (2) For the ACDC dataset, the Paired T-test is utilized, where the inputs are the organ’s segmentation accuracy of two paired groups (i.e., one is our method and the other is one compared method). The detailed p values are shown in Table 2. One can see that our proposed method slightly outperforms nnFormer and TransUNet, while significantly outperforms other methods. Meanwhile, our method still achieves a lower computational cost compared to nnFormer and other methods (see Sect. 4.4).

Table 3 Comparison of the numbers of parameters and FLOPs among various methods that segment 3D medical images directly
Table 4 Comparison of the numbers of parameters and FLOPs with/without key designs in our method

4.4 Comparison of model complexity

In Table 3, we compare the numbers of parameters and floating point operations (FLOPs) of our proposed D-Former with those of different 3D medical image segmentation models, including UNETR [38], CoTr [40], TransBTS [70], and nnFormer [32]. The number of FLOPs is calculated based on the input image size of 64\(\times\)128\(\times\)128 for fair comparison. We should note that we omit the part of the complexity brought by activation functions and normalization layers. Table 3 shows that our D-Former has 44.26M parameters and 54.46G FLOPs, which has a lower computational cost compared to nnFormer (157.88G FLOPs), TransBTS (171.30G FLOPs), CoTr (377.48G FLOPs), UNETR (86.02G FLOPs), and 3D U-Net (947.69G FLOPs). The CNN-based model of 3D U-Net has less parameters, but it is burdened with a high model complexity of 947.69G FLOPs, which is much bigger than our D-Former method. Moreover, compared with the other Transformer-based models, our model still shows comparable model complexity while outperforming these models by a large margin.

To further explore the effectiveness of our model in reducing model complexity, we remove two key designs, respectively, and compute the corresponding numbers of parameters and floating point operations (FLOPs). As shown in Table 4, one can see that our patch embedding layer, and local scope multi-head self-attention (LS-MSA) and global scope multi-head self-attention (GS-MSA) contribute to decreasing the model complexity considerably. Specifically, without the patch embedding layer, an input image is directly projected into C channel dimensions and fed to the subsequent Transformer architecture, leading to 1189.47G FLOPs. Besides, the introduction of LS-MSA and GS-MSA helps decrease the FLOPs from 477.19G to 54.46G, and this is consistent with the theoretical analysis in Sect. 3.3.2.

Fig. 4
figure 4

Visual comparison with several state-of-the-art methods on some hard samples of the Synapse dataset. The red marks regions where our model attains discriminative segmentation performance

4.5 Qualitative visualizations

To intuitively demonstrate the performances of our D-Former model, we compare some qualitative results of our model with several other methods (including Swin-Unet, TransUnet, and UNet) on the Synapse dataset, and some hard samples are shown in Fig. 4. One can see that the predicted organ masks of our model are much more similar to the ground truth in general. As for specific organs, our model has better accuracy in identifying and sketching the contours of stomach (e.g., the first and fourth rows), which is consistent with the conclusions based on the above quantitative results. In the second row, only our model can delineate the outline of pancreas well, thus suggesting that our model has a better ability to capture long-range dependency given the fact that the shape of pancreas is long and narrow. In addition, as illustrated in the third row, our D-Former is able to identify the true region of liver, while the other three models incur some mistakes on the liver. This shows that our method is effective at exploiting the relations between the target organs’ patches and the other patches, owing to our model’s dynamic position encoding block. In a nutshell, the qualitative visualizations provide intuitive demonstrations of our model’s high segmentation accuracy, especially on some slices that are difficult to segment.

Table 5 Ablation study on the effect of the global scope module (GSM) (average dice similarity coefficient (DSC))
Table 6 Ablation study on the effect of the global scope multi-head self-attention (GS-MSA) (average dice similarity coefficient (DSC))
Table 7 Ablation study on the effect of dynamic position encoding (DPE) (average dice similarity coefficient (DSC))
Table 8 Ablation study on the positions of the dynamic position encoding block (average dice similarity coefficient (DSC))
Table 9 Ablation study on the sizes of different architecture variants (average dice similarity coefficient (DSC))

4.6 Ablation studies

We conduct ablation studies on the Synapse dataset to evaluate the effectiveness of our model design.


Effect of global scope module (GSM). To investigate the necessity of the Global Scope Module (GSM), we replace it by the Local Scope Module (LSM), with the other architectural components unchanged. As shown in Table 5, one can see that the GSM is beneficial to the segmentation accuracy, outperforming using only LSM modules by 0.0066 in the average DSC. This verifies the necessity to explore global interactions of patches across units.


Global scope multi-head self-attention (GS-MSA) vs. other self-attention. In order to confirm the effectiveness of our GS-MSA, we compare it with the shift window strategy proposed in Swin Transformer [29] which achieves state-of-the-art performance in multiple computer vision tasks. Similar to our GS-MSA design, the shift window strategy (SW-MSA) aims to introduce global attention. Table 6 shows that our global attention design surpasses that in Swin Transformer by 0.0133 in the average DSC.


Dynamic position encoding vs. other position encodings. We compare our dynamic position encoding (DPE) with other common position encoding methods, including the relative position encoding (RPE) [29, 48], absolute position encoding (APE) [22], and sinusoidal position encoding (SPE) [22]. The results are shown in Table 7. Compared to APE, SPE, and RPE, our DPE improves them by 0.0405, 0.0279, and 0.0248 in the average DSC, respectively.

Fig. 5
figure 5

Different positions to apply the DPE block. D-Former block 3 is used as an example for illustration, which contains three LSMs and three GSMs, arranging in an alternate manner


Position of the dynamic position encoding block. We conduct experiments to examine the performances of different choices of positions to apply the dynamic position encoding block, including placing it (a) before the first LSM, (b) right after the first LSM, and (c) right after the first GSM, in every D-Former block, as illustrated in Fig. 5 taking D-Former block 3 as an example. Table 8 shows that introducing the position information before the first LSM provides the best segmentation outcomes.


The sizes of different architecture variants. To evaluate the performances of variants with different sizes, three variants of our D-Former are evaluated. Specifically, the architecture hyper-parameters of our model variants are:

  • D-Former-Small: C = 64, L = {2, 2, 2, 2},

  • D-Former-Base: C = 64, L = {2, 2, 6, 2},

  • D-Former-Large: C = 96, L = {2, 2, 6, 2},

where C is the channel number of the hidden layers and L is the total number of LSMs and GSMs in the encoder pathway. As shown in Table 9, D-Former-Large achieves the best performance in terms of the average DSC with 0.8883, improving by 0.0480 and 0.0142 comparing with D-Former-Small and D-Former-Base.

5 Conclusions

In this paper, we proposed a novel 3D medical image segmentation framework called D-Former, which utilizes the common U-shaped encoder–decoder design and is constructed based on our new Dilated Transformer. Our proposed D-Former model can achieve both good efficiency and accuracy, due to its reduced number of patches used in self-attention in local scope module (LSM) and its exploration of long-range dependency with a dilated scope of attention in global scope module (GSM). Moreover, we introduced the dynamic position encoding block, making it possible to flexibly learn vital position information within input sequences. In this way, our model not only reduces the model parameters and decreases the FLOPs, but also attains state-of-the-art semantic segmentation performance on the Synapse and ACDC datasets.