Introduction

A statistical analysis conducted by the American Cancer Society revealed that in 2023, prostate cancer (PCa) was the most prevalent male cancer in the United States, with an incidence rate of 29% and a mortality rate of 11%1. The early detection of prostate cancer depends on the precise localization of the prostate area, which can be facilitated by the Prostate Imaging Reporting and Data System (PI-RADS), which is based on magnetic resonance imaging (MRI)2. Radiologists frequently face the need for manual segmentation of prostate regions in MRI scans. However, the accurate identification and segmentation of anatomical structures entail a laborious and highly specialized process3. Persistent inter-observer variability poses a challenge to the precise delineation of regions of interest (ROIs) during manual segmentation4. Thus, there is an urgent demand for automated and reliable prostate MR image segmentation. The early segmentation algorithms that relied on threshold methods, edge detection filters, and machine learning techniques were cumbersome, time-consuming, and often unreliable5. Convolutional neural networks (CNNs), with their efficient hierarchical feature representation, have become widely employed in medical image segmentation with the rapid growth of deep learning. To solve the issue of positive and negative sample imbalance in segmented samples, Milletari et al.6 proposed V-Net for prostate MRI segmentation by integrating 3D convolution with the U-Net architecture and introducing a dice coefficient loss function. Rundo et al.7 proposed USE-Net, a convolutional neural network that incorporates Squeeze-and-Excitation blocks8 into U-Net, to complete the prostate region segmentation task. The intrinsic locality of convolution processes, however, restricts CNN's capacity to learn long-range relationships. The proposal of Transformer9 solves the limitations of the CNN model in global representation. It can accurately model long-range interdependence by utilizing the multi-head self-attention (MSA) method10. Convolutional coupled Transformer U-Net (CCTUnet), a U-shaped architecture based on a convolutional coupled Transformer, was proposed by Yan et al.11 for prostate MRI peripheral and transition zone segmentation. The MSA method, however, entails quadratic complexity relative to the image size12, leading to substantial computational overhead in processing downstream dense prediction tasks, such as medical image segmentation13. Thus, there is still no clear way to improve long-range reliance on CNNs.

Recently, Mamba14 has further improved the Structured State Space Sequence model (S4)15 through a selection mechanism, and it is considered promising for addressing the lack of long-range dependencies in CNNs. In continuous long sequence data analysis, such as natural language processing and genomic research, Mamba with linear complexity provides state-of-the-art performance, even outperforming Transformer. A new universal visual backbone with bidirectional Mamba blocks (Vim), for instance, was established by Zhu et al.16. It uses bidirectional compressed vision to express the state space model and labels image sequences with positional embeddings. To handle objects with heterogeneous appearances, Ma et al.13 proposed a new architecture called U-Mamba for general medical image segmentation. This architecture is thought to be a strong contender to serve as the foundation of future medical image segmentation networks.

Motivated by the aforementioned research, we intend to employ Mamba blocks to optimize CNNs' long-range modeling capabilities and propose MM-UNet, a U-shaped network for prostate MR image segmentation that combines CNN with Mamba. It is composed of three well-designed modules, skip-connections, and the traditional encoder-decoder structure. Specifically, we represent multi-scale features at the granular level using the 3D Res2Net17 encoder and use the adaptive feature fusion module (AFFM) and multi-scale anisotropic convolution module (MACM) to replace traditional skip connections. In addition, we propose a global context-aware module (GCM) that is integrated into the network's bottleneck layer at the bottom. Using two publicly available prostate MR imaging datasets, we evaluate our MM-UNet and get state-of-the-art performance in comparison to other competing approaches. Our contributions can be summarized as:

  • An AFFM based on the channel attention mechanism is proposed to adaptively recognize and select the most discriminating spatial and semantic information, thereby guiding the effective fusion between adjacent hierarchical features;

  • A GCM based on Mamba blocks with linear scaling of feature sizes is proposed to capture the global context information in the image with less computational cost, thus achieving the complementarity of local fine-grained features;

  • A MACM based on multi-scale anisotropic convolution and 3D convolution decomposition is proposed. It can achieve robust and accurate prostate segmentation by exploiting 3D contextual information and multi-scale features of MR images with anisotropic resolution while reducing complexity.

Methods

Overview

Figure 1 illustrates the proposed MM-UNet design, which uses an encoder-decoder network design to effectively extract local features and global context from prostate MR images. MM-UNet mainly consists of five components: (1) 3D Res2Net encoder for encoding features of different scales; (2) Adaptive feature fusion module (AFFM) for combining low-level encoder adaptively features with high-level decoder features; (3) MR image long-distance modeling is improved by the global context-aware module (GCM), which takes advantage of Mamba’s linear scaling properties; (4) Multi-scale anisotropic convolution module (MACM) for capturing multi-scale contextual information and overcoming anisotropic spatial resolution interference; and (5) Anisotropic convolutional layer-based 3D decoder for predicting segmentation results. Next, we’ll go into great depth about each component.

Fig.1
figure 1

The overall architecture of the proposed MM-UNet. MM-UNet follows the classic encoder-decoder and skip-connections structure and combines three carefully designed novel modules to complete the task of robust prostate MR image segmentation, including adaptive feature fusion module (AFFM), global context-aware module (GCM) and multi-scale anisotropic convolution module (MACM). The encoding module uses the 3D modified Res2Net block to extract multi-scale information, and the decoding module combines MACM with a kernel size of 3 and 1 × 1 × 1 convolution operation to perform layer-by-layer decoding.

3D res2Net encoder

As the encoder backbone, we employ Res2Net17, which has been pre-trained on the ImageNet dataset18. When compared to ResNet19, Res2Net’s hierarchical residual-like connections within each residual block allow it to more effectively use multi-scale information at the granular level, expanding each layer’s receptive field. A given image is first passed through a convolution stem to generate rich local information, which consists of parallel 3 × 3 × 3 convolutions and 1 × 1 × 1 convolutions followed by summation operations20. Then enter the four Res2Net residual blocks in sequence to generate features with different spatial resolutions. As Res2Net was first developed for 2D natural image analysis, 3D medical image segmentation requires its extension. To do this, we first replace the 7 × 7 2D convolution in the first layer with a 7 × 7 × 3 3D convolution, and then the 2D convolution of x × x is directly transformed into the 3D convolution of x × x × 1 in other layers21. It may be specifically said that the 7 × 7 × 3 kernel with single-channel 3D input is equivalent to the 7 × 7 kernel with three-channel 2D input in Res2Net. Thus, our encoder can be initialized using pre-trained Res2Net. When representing prostate MR images, this initialization can help transfer image representation capabilities acquired on large-scale natural image datasets.

Adaptive feature fusion module

Due to the complex anatomical structure and morphology of the prostate, high-level features from the decoder and low-level features from the encoder are crucial for accurate prostate prediction22. To this end, we propose an adaptive feature fusion module (AFFM) guided by an attention mechanism. This module can adaptively combine the most effective features between the two to obtain the best representation of the prostate structure. The specific structure is shown in Fig. 2.

Fig.2
figure 2

The specific structure of AFFM. It utilizes channel attention to guide adjacent features for adaptive fusion.

First, we upsample the high-level features \(F_{h}^{i + 1}\) to the same size as the low-level features \(F_{l}^{i} (i \in \{ 1,2,3,4\} )\) and fuse the high-level and low-level features through a channel concatenation operation. The specific process is shown in Eq. (1). Secondly, we introduce Squeeze-and-Excitation Block8 as the channel attention to calculate the weight vector, thereby re-weighting low-level features and suppressing the interference of irrelevant background noise. Specifically, to obtain the global information of each channel \(F_{c}^{i}\), we first use the global average pooling operation to stimulate the feature channel. Then the results after learning through two fully connected layers are input into the Sigmoid function to obtain the channel attention weight \(\alpha\). The above process is shown in Eq. 2. Finally, we sum the low-level features which are weighted by the weight \(\alpha\), and high-level features to get the final result \(F_{{}}^{i}\), as shown in Eq. (3). By employing AFFM to gradually guide the fusion between high-level and low-level features, the proposed MM-UNet can suppress irrelevant background noise and retain more semantic information for more precise localization.

$$ F_{c}^{i} = {\mathbb{C}}\left( {F_{l}^{i} ,f_{u} \left( {F_{h}^{i + 1} } \right)} \right) $$
(1)
$$ \alpha = f_{\sigma } \left\{ {f_{FC}^{2} \left( {\max \left( {0,f_{FC}^{1} \left( {f_{gap} \left( {F_{c}^{i} } \right)} \right)} \right)} \right)} \right\} $$
(2)
$$ F^{i} = \alpha \otimes F_{l}^{i} + F_{h}^{i + 1} $$
(3)

Among them, \({\mathbb{C}}\) represents the concatenation operation, \(f_{u}\) represents the upsampling operation, \(f_{\sigma }\) represents the Sigmoid function, \(f_{FC}^{i} (i \in \{ 1,2\} )\) represents the fully connected layer, \(f_{gap}\) represents the global average pool, and \(\otimes\) represents element-wise multiplication.

Global context-aware module

Inspired by previous studies13, we propose a global context-aware module (GCM) to achieve global context understanding in medical image segmentation. It can leverage the linear scaling property of Mamba to enhance the long-range dependencies of CNN.

As shown in Fig. 3, GCM consists of two consecutive Residual blocks19 and a Mamba block, where the Residual block consists of a normal convolutional layer, instance normalization (IN), and leaky ReLU activation13. Subsequently, the image features with size B × C × H × W × D are flattened and transposed to B × L × C, where L = H × W × D, B represents the batch size, C represents the number of channels, H and W represent the height and width of the feature map, and D represents the depth of the feature map.

Fig.3
figure 3

The specific structure of GCM. IN is the abbreviation for instance normalization.

After layer normalization, the features enter the Mamba block which contains two parallel branches for enhanced long-range dependency modeling. In the first branch, the features are expanded to B × 2L × C through a linear layer, followed by a 1D convolutional layer, a HardSwish activation function23, and an SSM layer. In the second branch, the features are also expanded to B × 2L × C through a linear layer, followed by a HardSwish activation function. Then, the features of the two branches are then merged using the Hadamard product, projected back to the original shape, resized, and transposed to a feature size of B × C × H × W × D. GCM can effectively improve the global context capture capability in prostate MR image segmentation while maintaining high efficiency during training and inference.

Multi-scale anisotropic convolution module

We propose a multi-scale anisotropic convolution module (MACM) with large convolution kernels to improve the classification and localization capabilities of the network24 and then combine it with AFFM as a skip connection at each scale. The specific structure of MACM is shown in Fig. 4a.

Fig.4
figure 4

The specific structure of MACM. (a) MACM consists of four ACBs with different convolution kernel sizes and constructed in parallel. (b) The specific structure of the 3D convolution decomposition principle. k represents the size of the convolution kernel.

The MACM consists of four anisotropic convolutional blocks (ACB) with a kernel size of k (\(k \in \{ 3,7,13,15\}\)) constructed in parallel. It can not only enhance the localization performance and voxel classification ability of the network but also fuse the features of different regions and reduce the loss of contextual information at different image scales25. Specifically, 1 × 1 × 1 convolution is first applied to the input and output to adjust the number of channels of the features, thereby avoiding an uncontrollable increase in computational complexity. Secondly, the inputs are passed to the ACBs of different branches respectively, and the results of each branch are connected with the inputs as the output.

In the ACB branch, the input is passed to the anisotropic convolutions of 1 × 1 × k and k × k × 1 respectively, and then the results are added as the output of this branch. With this design, the k × k × 1 convolution further exploits the 2D features in the x–y plane, while the 1 × 1 × k convolution can focus on inter-slice features. At the same time, we apply the 2D convolution decomposition principle26 to 3D convolution to avoid the possibility that large 3D convolution will increase the computational complexity. As shown in Fig. 4(b), the large kernel 3D convolution x × y × z is decomposed into the following combinations: x × 1 × 1 + 1 × 1 × z + 1 × y × 1 and 1 × y × 1 + 1 × 1 × z + x × 1 × 1. After decomposition, the number of parameters can be reduced from \(o^{3}\) to \(3o\).

Results

Dataset

Using two publicly accessible prostate MR image segmentation datasets, we evaluate the efficacy and performance of MM-UNet. The International Medical Image Processing Organizing Committee held the prostate segmentation competition in 2012, using the PROMISE12 dataset27. It comprises individuals with benign conditions and prostate cancer, and it originates from four different medical institutes. Of the 80 prostate T2-weighted axial MR images that are accessible, 50 have professional segmentation masks. 60 T2-weighted MR images, all of which are from patients with prostate cancer, make up the ASPS13 dataset28 that was utilized in the NCI-ISBI 2013 Automated Segmentation of Prostate Structures competition. The doctor marks the actual prostate boundaries in each training case. There are 10 prostate MR images in the test set; the ground truth of these images is not provided.

Implementation details and evaluation metrics

Python and the PyTorch framework are used in the construction of the proposed model, and experiments are conducted on hardware equipped with four NVIDIA 4090 GPUs. We used a variety of online data augmentation techniques, including random Gaussian noise addition, random flipping, and random rotation, to expand the training dataset after normalizing prostate MR images21. During training, a random crop size of 128 × 128 × 128 is used, and the loss function used is called cross-entropy loss. For all experiments, Adam29 was employed as the optimizer, and a poly learning approach was chosen, with a weight decay of 5 × 10−4 and an initial learning rate of 1 × 10−4. We employ a five-fold cross-validation approach for training and testing to acquire a fair and dependable performance of various methods30, and we apply the same data preprocessing and learning strategy for all experiments to produce a fair comparison31.

To evaluate prostate segmentation results scientifically and measure model effectiveness, we utilize three performance metrics32: Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (95HD), and Average Symmetric Distance (ASD). The better the segmentation accuracy, the higher the DSC value and the lower the 95HD and ASD values. These are determined using the following equations:

$$ DSC(X,Y) = \frac{2|X \cap Y|}{{|X| + |Y|}} $$
(4)
$$ 95HD = \max_{k95\% } \left( {d_{HD} (X,Y),d_{HD} (Y,X)} \right) $$
(5)
$$ d(x,A) = \min_{y \in A} d(x,y) $$
(6)
$$ ASD = \frac{{\sum\limits_{{x \in B_{X} }} d \left( {x,B_{Y} } \right) + \sum\limits_{{y \in B_{Y} }} d \left( {y,B_{X} } \right)}}{{\left| {B_{X} } \right| + \left| {B_{Y} } \right|}} $$
(7)

where X represents the prediction result, Y represents the ground truth value, \(| \cdot |\) represents the cardinality computation operation, \(d_{HD} (X,Y)\) represents the Hausdorff distance between X and Y, \(\max_{k95\% }\) represents the maximum value at the 95th percentile, \(B_{X} ,B_{Y}\) represents the boundaries of X and Y, and \(d(x,A)\) is the Euclidean distance of the voxels that make up the image’s actual spatial resolution.

Comparisons with the state-of-the-art methods

We evaluate our method against nine state-of-the-art methods, which include classic general segmentation networks (Attention U-Net33, V-Net6, and 3D U-Net34), Transformer-based medical segmentation models (TransUNet35, SwinUNETR36 and UNesT37), and methods specifically designed for prostate segmentation (MSD-Net21, CCT-Unet11, and CAT-Net38). When comparing our evaluation metrics to other approaches, we display both the standard deviation and average performance in the form of the mean ± standard deviation.

Table 1 quantitatively demonstrates the segmentation performance of MM-UNet and other comparison methods on the PROMISE12 and ASPS13 datasets. Our method works better than other techniques on two public prostate MR imaging datasets, according to experimental results. In particular, our method achieved an average DSC of 92.39%, 95HD of 3.43 mm, and ASD of 1.42 mm in experiments conducted on the PROMISE12 dataset. Our method gets the best indicator values out of all the methods. The efficacy of the proposed MM-UNet is demonstrated by the improvements in DSC, 95HD, and ASD compared to the traditional V-Net, which are 3.35%, 0.78 mm, and 0.61 mm, respectively. Among Transformer-based medical segmentation models and methods specifically designed for prostate segmentation, UNesT and MSD-Net are the techniques that perform the best, respectively, while our method is 1.42 and 0.51% higher than these two methods in DSC scores, respectively. In terms of DSC, 95HD, and ASD, the proposed method generally demonstrated improved reliability for prostate MR image segmentation. Similarly, in experiments on the ASPS13 data set, our method achieved the best performance, with values of 92.17%, 3.61 mm, and 1.67 mm for DSC, 95HD, and ASD, respectively. Compared with CCT-Unet, which ranks second on ASD, DSC, and 95HD, it has improved by 0.94% and 0.28 mm, respectively.

Table 1 Quantitative comparison of our method and others on the PROMISE12 and ASPS13 datasets.

Figure 5 and Fig. 6 display the qualitative evaluation results of our method and those of the competitors in common cases. In the slices of different situations, it can be observed that our method has fewer incorrect segmentations, and the segmentation boundaries are closest to the true situation. The careful model structure design of the proposed MM-UNet allows for considerable and consistent performance benefits on prostate MR images, as demonstrated by quantitative and qualitative analysis. These outcomes show the effectiveness of MM-UNet in solving the varying and complex semantics of prostate regions in this challenging task.

Fig.5
figure 5

Qualitative comparison of our method and others on the PROMISE12 dataset. The red line represents the real situation, and the blue line represents the segmentation results of various models.

Fig.6
figure 6

Qualitative comparison of our method and others on the ASPS13 dataset. The red line represents the real situation, and the blue line represents the segmentation results of various models.

Ablation study

In our proposed network, we conduct an ablation study on the PROMISE12 dataset to evaluate the performance of different methods loaded with various modules, therefore demonstrating the efficacy of diverse components. As a baseline, we employ the 3D U-Net34 architecture. To lower the computational complexity, we use a 4-layer structure in the encoder as opposed to the 5-layer structure seen in the typical U-Net design. To ensure a fair comparison, all competitors in our ablation experiments performed on the same computing environment with the same data augmentation.

We conduct step-by-step ablation experiments by substituting our presented 3D Res2Net encoder, global context-aware module (GCM), adaptive feature fusion module (AFFM), and multi-scale anisotropic convolution module (MACM) for the corresponding modules in the baseline structure. As can be seen from the quantitative experimental results in Table 2, compared with the baseline, the 95HD of model 1 is reduced by 0.11mm, proving the advantages of the 3D Res2Net encoder. The DSC of Model 2, Model 3, and Model 4 improved by 0.81, 0.98, and 0.57%, respectively, in comparison to Model 1, demonstrating the efficacy of GCM, AFFM, and MACM. The DSC of MM-UNet is enhanced by 0.66, 1.19, and 0.94% in comparison to Models 5, Model 6, and Model 7, respectively. This indicates that utilizing GCM, AFFM, and MACM together can further enhance the model's performance. For DSC and 95HD, MM-UNet gets significant improvements of 2.49% and 1.13mm, respectively, over the baseline, demonstrating the superior segmentation performance of the proposed network. Figure 7 qualitatively demonstrates the advantages of the proposed prostate MR image segmentation module. By comparing the segmentation results of Model 1 and Model 2, we can see that after the introduction of GCM, the network has a stronger ability to identify the prostate area, which proves that GCM can fully capture global context information to optimize segmentation edges. Model 5 shows less-segmentation and under-segmentation as compared to Model 2, indicating that AFFM can adaptively fuse low-level and high-level information to improve the model’s segmentation capabilities. MM-UNet integrates GCM, AFFM, and MACM at the same time, and more comprehensive and smooth segmentation results are produced. These results prove that the synergistic effect between modules can well improve the segmentation results of the prostate edge.

Table 2 Prostate segmentation performances of different models in our system..
Fig.7
figure 7

Visual comparison between different models in our system for prostate segmentation. The colors white, green, and red represent the correct segmentation, the under-segmentation, and the over-segmentation, respectively.

Further, we conducted ablation experiments on GCM and MACM to explore the best performance combination. The ablation study results for GCM are displayed in Table 3. It can be seen that the use of Residual Block and HardSwish activation contributed to the final performance. At the same time, we also discussed the selection of multi-scale convolution kernels and the necessity of anisotropic convolution in MACM, and Table 4 presents the results of the study. It can be seen that by combining small kernel convolution and large kernel convolution, multi-scale feature information can be well captured so that scale changes between different image instances can be processed. At the same time, compared with using ordinary 3D convolution, using anisotropic convolution has improved the image quality by 0.12 mm on the 95HD. This indicates that the 3D context information of MR images with anisotropic resolution can be more effectively used by anisotropic convolution.

Table 3 Ablation study on global context-aware module.
Table 4 Ablation study on multi-scale anisotropic convolution module.

Discussion

Prostate MR image segmentation has been facilitated by the advent of deep learning and other convolutional neural network-based techniques in recent years6. The fixed receptive fields increase CNN's computational efficiency, but they also make it less capable of capturing long-range relationships39. In contrast to convolutional neural networks, the Transformer uses self-attention to provide each image patch with a global context that is dependent on input or patches. However, 3D medical images often have high resolution, which may cause a significant computational burden for Transformer-based methods40. Recently, the structured state space sequence model Mamba has been demonstrated to be an efficient and effective tool for modeling long sequences. This model scales linearly, or nearly linearly, with sequence length.

Motivated by the previous studies, we propose in this study a U-shaped encoder-decoder network for prostate segmentation in MRI called MM-UNet, which is based on Mamba and U-Net. The three main contributions of MM-UNet are the introduction of three novel modules, namely AFFM, GCM, and MACM. The quantitative evaluation results (Table 1) and qualitative evaluation results (Fig. 5 and Fig. 6) on two public prostate MR image segmentation datasets clearly illustrate MM-UNet’s efficacy and robustness, which is better than other SOTA methods.

Although our model achieves impressive results, it still has some limitations. First, since manual annotation is a costly and time-consuming undertaking, this results in a limited amount of available MR image training data25, limiting the peak performance of the model. Secondly, our model is specifically designed for prostate MR image segmentation tasks, and it is uncertain how well it performs on medical image segmentation tasks involving more modalities and organs. To improve model robustness and mitigate overfitting risks, our future research will focus on leveraging large-scale unlabeled medical images through self-supervised learning. Additionally, we’ll compare more segmentation backbones and explore additional medical image segmentation tasks from various modalities and targets.