Introduction

The emergence of deep learning is gradually replacing traditional machine learning models as well as optimization-based algorithms like active contours and level set methods that have been widely applied in MRI image segmentation. In the case of cardiac image segmentation from MRI, deep learning models could help automatically segment the interested organs and desired areas like left ventricle, right ventricle, myocardium, and myocardial infarction areas [1,2,3]. The segmentation in cardiac MRI is necessary for further analysis and diagnosis of heart failures and many cardiac applications including scoring coronary artery calcium, plaque analysis, left ventricular analysis, diagnosing myocardial infarction, prognosticating coronary artery disease, arterial disease, evaluating cardiac function, and diagnosing and prognosticating heart diseases.

With its ability to automatically learn features, deep learning allows models to learn complex features from data without the need to explicitly define and extract specific features beforehand. This eliminates or reduces the dependence on human intervention in feature design and helps the model automatically discover complex patterns and rules in images, thus bringing automatic segmentation performance close to manual segmentation [4,5,6,7]. Deep learning can learn from millions or even billions of cardiac MRI images, enhancing prediction and classification capabilities through models such as convolutional neural networks (CNNs) or autoencoders. However, early-stage deep learning models require a large amount of data and high computational costs and only achieve average performance. Patch-based methods in CNNs involve dividing the input image into overlapping patches and processing each patch independently. While this method has the advantage of capturing local features and gathering spatial information, it also has a major drawback: redundancy in the inference process. The fully convolutional neural network (FCN) version addresses some issues in the pioneering work of Long et al. [8]. The FCN version improves upon CNNs by enabling the processing of arbitrary input images through an encoder-decoder structure, utilizing the sampling architecture facilitated by the transformation of convolutional kernels. Tran [9] has contributed to the segmentation of the left and right ventricles through the application of the FCN. However, the FCN has shown limitations in capturing detailed contextual information in images for accurate segmentation. To achieve more accurate segmentation, Ronneberger et al. [10] proposed the U-Net model, a famous variant of the FCN network. U-Net utilizes skip connections to avoid the loss of contextual information that FCN may suffer from. The emergence of attention blocks gradually replaced skip connections as they enhance the segmentation capability by focusing heavily on important regions of the image. For example, Attention U-Net, proposed by Ozan et al. [11], builds upon the U-Net model by incorporating an attention gate mechanism to extract coarse-scale features used for gating in skip connections, distinguishing irrelevant responses and noise. Additionally, self-attention mechanism combined with position encoding transforms the relative positional information of elements in the computation sequence, creating a transformer network architecture without the need for convolutional layers and skip connections.

Though having shown superior performance in computer vision tasks, the direct application of transformers in medical image segmentation still suffers from some shortcomings. Chen et al. [12] interpreted that transformers process input as 1D sequences and only focus on modeling global context at all stages, resulting in low-resolution features and lacking detailed localized information. This information cannot be effectively recovered by directly upsampling to full resolution, leading to coarse segmentation results. On the other hand, CNN architectures (e.g., U-Net) provide a method to extract low-level visual signals that can effectively handle such small spatial details. Therefore, Chen et al. proposed the TransUnet model that combines U-Net and transformer to leverage the benefits offered by both architectures. However, recognizing the strong dominance of CNNs in medical image segmentation, Cao et al. [13] proposed a model that utilizes a pure Swin-Transformer architecture, inspired by the U-Net-like encoder-decoder framework. In this approach, prior to entering the Swin-Transformer, the input images are divided into non-overlapping patches. After encoding, the patches are decoded through a combination of patch merging layers and the Swin-Transformer. When performing machine learning tasks, we have found that training transformer models, such as TransUnet, Swin-Unet, and MISSFormer [14], incurs significant computational costs. These models are pretrained with large memory, which can pose challenges in terms of updating and expanding them. If there are changes in the training data or new task requirements, updating and expanding pretrained models may require retraining from scratch or result in substantial time and resource expenses. When learning from small amounts of data, using transformer models with large memory becomes unnecessary and overly expensive. They may not provide substantial benefits that justify the required resources.

Motivated by the above concerns regarding the image segmentation architecture for cardiac MRI images, in the current study, we propose a new model along with a novel loss for training the neural network. In particular, the proposed model, namely CapNet, is a harmonization of attention blocks that processes local information in clusters, highlighting the global information feedback. Furthermore, the model helps minimize computational costs and model parameters while ensuring a balance in learning data processing.

Related Work

Deep learning has emerged as the primary trend in addressing healthcare automation problems in recent years. With the strong development of deep learning over the past decade, the methods utilizing deep learning in Cardiac MRI image segmentation have undergone significant diversity and transformation. There are two main approaches to implementing this problem using deep learning. The first approach involves feeding the entire 3D volume of cardiac MR images into a deep learning model [15,16,17], which can be challenging due to the large volume and computational time required. Therefore, the second approach, using 2D slices of the 3D volume in the deep learning model, is the approach we adopt in this study.

In the past, there have been many studies following this approach, which we categorize into two main types: CNN-based and Transformer-based. The CNN-based approach is the most common direction, with numerous studies adopting this method such as [18, 19], and [20]. Cui et al. [18] utilized Attention U-Net along with a pyramid input image to retain maximum spatial information. Chen et al. [19] employed U-Net with dropout normalization layers after concatenation to reduce noise in the U-Net Decoder. Wang et al. [20] used U-Net with skip connections comprising multiple layers to connect low-level and high-level features. Overall, CNN-based neural networks are proficient in extracting both local and global information by employing convolutional operations with a strong inductive bias, allowing them to acquire robust representations. However, the use of multiple convolutional operations can sometimes result in inadequate handling of long-range dependencies and loss of spatial information. This issue is contradictory because segmentation tasks typically necessitate substantial spatial information. The remaining approach is Transformer-based, which was first introduced in 2020 [21], but there have been quite a few studies based on transformer-based methods used in cardiac segmentation [14, 22]. Huang et al. [14] introduced a fully transformer architecture that supports local feature context. Li et al. [22] proposed a transformer architecture, parameterized in a low-complexity form using Axial Attention [23]. In fact, transformer-based methods may yield suboptimal results when trained on insufficiently large datasets, particularly when dealing with medical datasets that pose additional challenges. In scenarios with limited data, transformer-based models often rely on pretrained weights to achieve desired outcomes effectively due to the lack of inductive bias. Furthermore, the high parameter count and complexity of transformer-based models can pose challenges during deployment. Consequently, hybrid architectures that integrate CNN-based and transformer-based approaches are garnering increasing attention from researchers. These architectures leverage the strengths of both methods to address their respective limitations, such as [12, 24, 25].

The methods discussed have demonstrated very good effectiveness in the task of cardiac MRI image segmentation, showcasing the efficacy of the approach involving dividing 3D volumes into 2D slices. However, these methods, whether CNN-based or transformer-based, entail a large number of parameters and have been applied to small datasets. It would be more effective to have a model with fewer parameters that matches the size of the data. Lightweight models have been developed to address these challenges [26, 27]. To our knowledge, there are currently few lightweight models specifically designed for cardiac MRI image segmentation tasks, and it would be very promising to have a model that strikes a balance between parameters and performance for this task.

Materials

Depthwise Separable Convolution

Depthwise separable convolutions have been proposed by Chollet in Xception model [28] for image classification tasks that reduce the computational cost of the convolution while maintaining good performance. While normal convolution uses a single filter spanning across multiple channels, depthwise separable convolution splits the computation into two steps, depthwise convolution (DW) and pointwise convolution (PW). Depthwise convolution involves applying a convolutional filter to each input channel, allowing the model to capture channel-specific information. Pointwise convolution [29] utilizes a \(1 \times 1\) kernel applied individually to each pixel. The intuitive idea is that the kernel size is small, allowing it to capture fine-grained details in the image. In other words, in depthwise separable convolution, depthwise convolution step applies a separate kernel to each input channel, and pointwise convolution then combines the resulting feature maps using \(1 \times 1\) convolution.

Priority Attention

Attention mechanisms in deep neural networks help the network focus on important information within domains such as channel and spatial. Recently, inspired from the greedy algorithm [30], Le et al. [31] proposed a new attention mechanism called priority attention comprising two variants: Priority Channel Attention (PCA) and Priority Spatial Attention (PSA). Both PCA and PSA employ attention mechanisms based on the variations of feature maps after convolution operations. The PCA architecture uses depthwise convolution to select features for each channel, then a channel-specific feature vector is used to compare channels that change a lot before passing through softmax to produce the attention vector. With this option, the featured output is filtered by channels and does not require additional parameters. Similar to PCA, PSA is an architecture based on the deviation in each pixel to produce an attention matrix based on the feature output of pointwise convolution. PSA carries spatial feature information and selects features to produce an effective feature set. Both PCA and PSA were used for the first time in classification problems. However, in the segmentation problem, the features obtained through attention are very important, which can increase efficiency. Therefore, applying PCA and PSA to the segmentation problem can help improve performance without increasing model parameters.

Pooling Attention Based on MLP

During the process of image dimension reduction through computation, information loss may occur as a consequence of the pooling function. However, on the other hand, applying an attention mechanism to the dimension reduction process can help retain essential information from the input. In the CPA-Unet [32], a pooling attention mechanism is utilized, which shares a structural resemblance with SE (Squeeze-and-Excitation) and ECA (Efficient Channel Attention) techniques. The module is split into two branches, incorporating a combination of average-pooling, max pooling, and MLP (multilayer perceptron) layers. This approach allows for the retention of essential information without significantly increasing the number of parameters. Therefore, incorporating PA (Pooling Attention) blocks into the encoder would be appropriate for improving the model’s output.

Methodology

The Proposed Model

Our Proposed CapNet Model

Observing the surveys for cardiac segmentation [1, 2], we propose a new model for cardiac MRI image segmentation. The proposed model has the encoder-decoder symmetric architecture as the U-Net [10]. In our proposed architecture, CapNet is shown in Fig. 1. In the CapNet encoder, we construct a block consisting of Conv-block and Pooling Attention (PA). The input with the shape (B, C, H, W) of cardiac segmentation data is first passed through the Conv-block that undergoes convolution with channel (C) equal to 1 and a kernel size of 3 to extract local features, followed by batch normalization (Batch Norm), ReLU activation, then a convolution operation to further learn the selected features.

Fig. 1
figure 1

Our proposed CapNet model

After the Conv-block, the features are fed into the Pooling Attention based on MLP. Through this attention mechanism, we aim to extract information from the feature maps while expanding the receptive field to reduce information loss when reducing the dimension through max pooling. Four Conv-blocks with Pooling Attention based on MLP are established, with the dimensions decreasing by [\(\frac{1}{2},\frac{1}{4},\frac{1}{8},\frac{1}{16}\)] while the filter sizes of the encoding blocks increase to [16, 32, 64, 128] respectively. At the bottom of the architecture, also called bottleneck, the features are not directly transferred from the encoder to the decoder as in U-Net. Instead, they will traverse a bridge similar to the PASPP module in [33], or Convmixer [34] architecture for a bottleneck in [35]. In this study, we propose a new module for the bottleneck, named Priority Mixer Block that shows superior performance compared to commonly used ASPP, PASPP bottleneck, and Convmixer module. Detail on the proposed Priority Mixer Block will be described in the next subsection.

For the decoder, we gradually increase the upsampling blocks to restore the channels to their original state to generate the desired predictions. The blocks following the bridge include Conv-block and the proposed Depthwise-Focus (DWF) Block based on Wide-Focus [36]. Between these blocks, additional Upsampling layers are added to restore the dimensions, and the filters decrease inversely compared to the encoding. The skip connections from the encoder are concatenated with the corresponding blocks in the decoder.

Our Proposed Priority Mixer Block

In the current study, inspired by PCA and PSA modules in [31] that have shown effectiveness for fish classification problems, we adapt these architectures for the model of cardiac MRI segmentation. To this end, we replace depthwise separable convolution with the Priority Channel Attention block, where the Attention block has an identity branch to avoid the vanishing problem. The PCA block selectively emphasizes informative channels within feature maps, allowing for the representation of important patterns. Then, instead of using pointwise convolution in the ConvMixer, we replace it with PSA\(++\) which is an upgrade proposed by us based on PSA that extends MLP’s receptive field through height and width. The detailed description of the proposed Priority Mixer block is given in Fig. 2.

Similar to the PSA, in the proposed PSA\(++\), we extend the attention mechanism beyond channels to encompass the spatial dimensions, H and W. This means that instead of focusing solely on channel-wise relationships, we also consider the relationships between pixels in the height and width dimensions. By incorporating spatial attention, we can better capture spatial dependencies and improve the representation power of the model. Beginning with the specific feature \(x^{(B, C, H, W)}\), we diverge into three pathways, each processing pixel information from the perspectives of channel, height, and width. This allows us to capture different aspects of the input feature and extract relevant information for each dimension. By treating each dimension separately, we can better understand the relationships and patterns within the data. Reshape operations are employed for the transition from channel to height or width. This reshaping allows us to perform computations specific to each dimension. For example, in the height pathway, we reshape the input feature (B, C, H, W) to have dimensions (B, H, C, W), effectively treating each pixel along the height dimension as a separate entity. The subsequent steps follow a similar methodology as PSA. In each pathway, we compute the average across all channels for the corresponding dimension (channel, height, or width) within the reshaped blocks. This averaging operation helps to capture the overall characteristics of each dimension and summarize the information across channels. This averaging can be seen as treating the average across channels as if it were the height or width dimension. By doing so, we can effectively reduce the dimensionality of the feature and focus on the most important aspects within each pathway. Following that, all three pathways undergo a pointwise convolution (PW) operation, similar to the step performed in PCA, but with pointwise convolution (PW) instead of depthwise convolution (DW). Pointwise convolution is applied to each pathway to further process the features and capture higher-level representations. This operation helps combine information from different dimensions and channels, leading to a more comprehensive understanding of the data. Subsequently, we calculate the average across all channels for each pathway, resulting in \(S'c^{(B, H, W)}\), \(S'h^{(B, C, W)}\), and \(S'w^{(B, H, C)}\).

Fig. 2
figure 2

Our Priority Mixer Block

These output features represent the enhanced spatial attention within each pathway. By calculating the average across channels, we obtain a summary of the attention weights for each dimension. To further refine the spatial attention, we probabilistically normalize the attention weights within each pathway. This involves subtracting the corresponding difference tensors: \((S'c - Sc)\), \((S'h - Sh)\), and \((S'w - Sw)\), which capture the changes in attention after the spatial processing. By subtracting the original attention weights, we can focus on the changes and identify the areas that have received more or less attention. Finally, we apply the softmax function to these tensors, which scales the values to range between 0 and 1 and ensures that they sum up to 1. This normalization step allows us to interpret the values as probabilities and obtain a distribution of attention weights for each dimension. These attention weights can then be used to weight the features or guide subsequent computations in the neural network model. Additionally, to maintain stability and avoid excessive fluctuations during training, the spatial attention coefficients are computed using the following formulas:

$$\begin{aligned} F'_sc = \sigma [S'c \times (1 + softmax2d(S'c - Sc) ] \end{aligned}$$
(1)
$$\begin{aligned} F'_sh = \sigma [ S'h \times (1 + softmax2d(S'h - Sh) ] \end{aligned}$$
(2)
$$\begin{aligned} F'_sw = \sigma [ S'w \times (1 + softmax2d(S'w - Sw)] \end{aligned}$$
(3)
$$\begin{aligned} F'_s = F'_sc + F'_sh + F'_sw \end{aligned}$$
(4)
$$\begin{aligned} x = x \cdot F'_s \end{aligned}$$
(5)

Our Proposed Depthwise-Focus (DWF) Block

From experiments and observations, we found out that in the encoder-decoder architecture, the decoder achieves the best performance when it simultaneously decodes local information and global context and then recovers details from the spatial source, and the previously encoded feature maps. Thus in this work, we propose the Depthwise-Focus Block, as shown in Fig. 3 that takes into consideration the above findings. In particular, we connect a Depthwise Separable Convolution layer right after the Conv-block to generate a convolutional filter for each input channel, allowing decoding of global information at the output of each channel.

To enhance the decoding process and focus on desired factors, we use additional depthwise convolutions with \(kernel size = 1 \times k\) and \(kernel size = k \times 1\). Compared to the standard convolution, it only requires 2k parameters instead of \(k^{2}\). When \(k\ge 3\), \(2k < k^{2}\), leading to a significant reduction in computational cost as k increases. In our experiments, we found that \(k = 7\) yields the best performance. We also experimented with linearly increasing dilation rates, where these three convolutions are parallelly added together. We experimented with different dilation levels combined to emphasize flexibility in local detail depending on the linear dilation rate, avoiding the inefficiency of the model’s accuracy after the network’s learning process saturates.

Moreover, this direct emphasis helps stabilize the architecture by accurately learning focused pixels from the blocks in the encoder and the Priority Mixer bridge. Additionally, when parallelly adding the depthwise separable convolution blocks with the standard convolution as mentioned above, it improves the network’s ability to replicate global maps in deeper layers. We incorporate this into the Wide-Focus module architecture introduced in [36], instead of using standard convolution with different dilations. We replace them with depthwise separable convolution, following Fig. 3, where the parallel dilation order is 1, 2, 3 for the best results. However, to achieve optimal performance and avoid information loss, we added a Residual Block parallelly with element-wise addition. This block mitigates the vanishing gradient problem from the direct connection layer from the encoder, which is connected to the decoding block using the same filter. We observed a significant improvement in results by integrating the proposed block into the architecture.

Fig. 3
figure 3

Our DepthWise-Focus (DWF) Block

The Proposed Loss Function

The Tversky Shape Power Distance (TSPD) Loss Function

Along with the advancements in deep learning models, the loss functions have gradually evolved to shorten convergence time and capture the regions where the model performs best [37,38,39,40]. In this study, inspired by the shape distance described by Pham et al. in [41], we propose a modified loss for training the network. Our proposed shape distance term measures the dissimilarity between the predicted mask \(\hat{y}\) with \(\hat{y} \in [0,1]\) and ground truth y with \(y \in \{0, 1\}\). Denote N be the number of pixels of the maps. We increase the shape distance distance by a power of m in the predicted mask. The modified shape distance is rewritten as follows:

$$\begin{aligned} L_{d}(y,\hat{y}) = \frac{1}{N}\sum _{i=1}^{N} (y_{i}(1-\hat{y_{i}}^{{m}}) + \hat{y_{i}}^{{m}}(1-y_{i})) \end{aligned}$$
(6)

Instead of directly applying the weight of \(\frac{1}{N}\) to \(L_{d}\) as described earlier, we reduce this weight by scaling it with the ratio of the sum of the denominators of the Tversky loss function [38]. Accordingly, we propose a loss function in the following form:

$$\begin{aligned} L_{d}(y,\hat{y}) = \frac{\sum _{i=1}^{N} (y_{i}(1-\hat{y_{i}}^{{m}}) + \hat{y_{i}}^{{m}}(1-y_{i}))}{\sum _{i=1}^{N}(y_{i}\hat{y_{i}}+\alpha (1-y_{i})\hat{y_{i}}+\beta (1-\hat{y_{i}})y_{i} )} \end{aligned}$$
(7)

where \(\alpha\) and \(\beta\) are hyperparameters denoted in the Tversky loss [38]. True positive (TP) is \(y_{i}\hat{y_{i}}\), false postive (FP) is \((1-y_{i})\hat{y_{i}}\), and false negative (FN) is \((1-\hat{y_{i}})y_{i}\). In our simplified loss function, when m = 1,\(\alpha = \beta = 1\), Eq. 7 becomes the Jaccard/IoU loss. Choosing \(\alpha\) or \(\beta\) depends on the purpose of adjusting the False Positive rate or False Negative rate to be compatible with the characteristics of the datasets. Based on experiments, we observed that \(\alpha\) and \(\beta\) are two parameters that follow the proportion \(\alpha + \beta = 1\). We experimented with a ratio of \(\alpha : \beta = 3:7\), which yielded good results. To find the optimal range of values for m, we assume cases where \(\alpha : \beta = 3:7\), and the ground truth \(y = 1\). To simplify, \(\hat{y}\) will be gradually increased within the range [0,1]. The resulting graph is shown in Fig. 4:

Fig. 4
figure 4

Incidence of parameter m in Tversky Shape Power Distance loss

In the picture shown in Fig. 4, we observed that when m < 1, the function focuses on accurately predicting low-density pixels that are misclassified. Testing with m\(\ge 1\) slope values yields more stable results and better performance. To achieve an appropriate value for m, we described the varying value in the graph in Fig. 4. When m belongs to the range [\(\frac{4}{3}\), 3], we obtained better results. Among them, m\(= 2\) is the best value that we used throughout the training process. It is worth noting that the \(\hat{y}\) in Eq. 6 is close to the degree of membership function in fuzzy active contour models [42]. In this formulation, the power m plays the role of a weighting coefficient on the fuzzy membership and is commonly set equal to 2 in the fuzzy logic field.

Formulation with Tversky Shape Power Distance (TSPD) Loss Function

Based on the extended description of the multiclass level set proposed by Trinh et al. in [35] and Kim and Ye in [43], we replaced the ground truth y with \(\textbf{Y}\), which is the input one-hot vector. \(\textbf{Y}\) is composed of multiple channels, where each channel contains a binary segmentation mask. These masks are used to determine the spatial domain of class k within the set 1, 2, 3..., N. Each channel in \(\textbf{Y}\) represents a specific class and distinguishes the regions assigned to that class with binary values. \(\textbf{P}\) is denoted as the output softmax of the network \(P(\phi )\). Formulation with Tversky Shape Power Distance (TSPD) loss function described as follows:

$$\begin{aligned} L_{d}(\textbf{Y},\textbf{P}) = \frac{\sum _{k=1}^{N}\sum _{x\in \phi } (\mathbf {Y_{kx}}(1-\mathbf {P(\phi )_{kx}}^{{m}}) + \mathbf {P(\phi )_{kx}}^{{m}}(1-\mathbf {Y_{kx}}))}{\sum _{k=1}^{N}\sum _{x\in \phi }(\mathbf {Y_{kx}}\mathbf {P(\phi )_{kx}}+\alpha (1-\mathbf {Y_{kx}})\mathbf {P(\phi )_{kx}}+\beta (1-\mathbf {P(\phi )_{kx}})\mathbf {Y_{kx}} )} \end{aligned}$$
(8)

The loss function we propose will gradually approach 0 as the output \(P(\phi )\) of the architecture approaches the closest match to \(\textbf{Y}\). If the predicted output deviates significantly, the exponential function m that we incorporate will decrease, thereby increasing the number of correctly classified pixels in the ground truth.

Evaluation Metrics

In image segmentation, the two most commonly used evaluation metrics are the Dice similarity coefficient (DSC) and the intersection over union index (IoU) also known as the Jaccard Index. The DSC statistically measures the similarity between the segmentation map and the ground truths, and the IoU gauges the similarity and diversity of sample pixel sets.

The statistical significance analysis of a segmentation model compared with other models is demonstrated by the p-value. The assumed statistical significance level of p-value was equal to 0.05. The determination of the model’s statistical significance was based on the p-value using the non-parametric Wilcoxon signed-rank test [44], which is employed for hypothesis testing. Particularly, in the current study, the segmentation scores including DSC and IoU of different models compared with the proposed model are evaluated by computing the p-value between the two models.

Experiment

Datasets

The Sunnybrook Dataset

The Sunnybrook dataset [6] provided by the Sunnybrook Health Sciences Centre, Toronto, Canada, proposed in the MICCAI 2009 LV segmentation challenge. The dataset includes cardiac cine-MRI images (1.6 GB) in the DICOM format collected from 45 patients. The patients are from a diverse range of cardiac conditions like healthy hearts, hypertrophy, heart failure with infarction, and heart failure without infarction. The data also includes manual segmentation contours by Perry Radau from the Sunnybrook Health Science Centre that includes the endocardium and epicardium from slices in various phases including the end diastolic (ED) and end systolic (ES). All the images were obtained during 10–15 s breath-holds with a temporal resolution of 20 cardiac phases over the heart cycle and scanned from the ED phase. The endocardium and epicardium images are split into 3 parts with a ratio of 70:15:15 for respectively training, validation, and testing. The data are resized to the resolution of \(256 \times 256\) pixels.

The MRI Cardiac ACDC Dataset

The ACDC dataset [45] was generated using real clinical exams conducted at the University Hospital of Dijon. To ensure privacy, all acquired data underwent a thorough anonymization process and were handled in compliance with the regulations established by the local ethical committee of the Hospital of Dijon in France. The ACDC dataset consists of 100 patient 4D cine CMR scans. Each scan includes segmentation labels for the left ventricle (LV), the myocardium (Myo), and the right ventricle (RV) during the end-systolic and end-diastolic phases. The dataset was divided into three sets: a training set, a validation set, and a testing set, with a split ratio of 70:10:20. All images have been resized to \(128 \times 128\) pixels in this study.

The MS-CMRSeg 2019 Dataset

The data of MS-CMRSeg 2019 (or MS-CMR) [46, 47] contained 45 multi-sequence CMRs, provided by the organizers of the Multi-sequence Cardiac MR Segmentation Challenge. The MS-CMRSeg 2019 dataset aims to capture specific aspects of cardiac imaging using different CMR sequences. Magnetic resonance imaging (MRI) is widely used to gather both anatomical and functional details of the heart. To visualize acute injuries and ischemic regions, the T2-SPAIR CMR sequence is utilized. Meanwhile, the bSSFP cine CMR sequence captures cardiac motions and establishes distinct boundaries. For visualizing myocardial infarction, the LGE CMR sequence is specifically designed. The T2-weighted, black blood spectral presaturation attenuated inversion-recovery (SPAIR) sequence generally includes a limited number of slices. For example, out of the 45 cases in the dataset, 13 cases consist of only three slices, while the remaining cases contain five (13 subjects), six (8 subjects), or seven (one subject) slices. On the other hand, the bSSFP cine CMR sequence is a balanced steady-state, free precision cine sequence that typically consists of 8 to 12 contiguous slices. These slices cover the ventricles entirely, from the apex to the basal plane of the mitral valve. Some cases may include additional slices beyond the ventricles. The images and masks of all sequences are resized to the resolution of \(256 \times 256\) pixels. The train-to-valid-to-test ratio for this data is 70:10:20.

Implementation Details

We have performed the proposed network, CapNet with our proposed customized Tversky Shape Power Distance loss to segment MRI images. Our model is trained on a workstation with NVIDIA Tesla P100 16GB GPU. The minimization is performed on several epochs using AdamW optimizer [48] with an original learning rate of 1e-3. Every 5 epochs, the learning rate is divided by 2, before reaching 0.00001, and is then constantly kept through the remainder training period with 200 epochs for all three datasets. The datasets are also augmented by various techniques, such as rotation, flipping, and scaling, to further increase the diversity of the training data.

Experimental Results

Model Visualization

Deep learning models deliver unprecedented breakthrough results in computer vision tasks. Although these models exhibit outstanding performance, their complexity renders them impossible to decompose into smaller parts for interpretation. When problems arise, we can only rely on guesswork since we cannot pinpoint the exact cause. Recently, the interpretation of deep learning models has become possible using gradient-based methods from the target layer to the component neurons. This approach provides a more intuitive understanding of how deep learning models function and helps identify important neurons in deep learning networks. In this study, we interpret our model by visualizing important layers using the Grad-CAM [49] method.

Fig. 5
figure 5

Visualization of the sequential process through the proposed model to influence the segmentation output using Grad-CAM

In the illustration, Fig. 5 represents the output formation of all three categories RV, Myo, and LV by the proposed CapNet model. The heatmap overlaid on the diagram depicts the concentration of component layers about the target segmentation layer. It can be observed that the Encoder block diversifies feature maps, but the initial layers only extract raw information and have minimal impact on the target segmentation layer. In deeper layers, regions tend to be more pronounced and diverse, extracting higher-level information, resulting in more diverse feature maps.

Fig. 6
figure 6

Visualization of the sequential process through the proposed Priority Mixer Block to influence the segmentation output

Information at the end of the Encoder block is passed through the proposed Bottleneck block - Priority Mixer. When this block is introduced, information is condensed and focused using Channel Attention and Spatial Attention mechanisms. Subsequently, the feature map is passed through the Decoder, tasked with upsampling to generate the segmentation image. It is noticeable that information introduces noise upon entering the decoder, attributed to the skip connection from the encoder and additional convolutional layers within it. However, over successive layers, adjustments are made to gradually refine and synthesize information for accurate segmentation.

The layers near the end concentrate precisely on the target segmentation layer. In these layers, updates are made to the weights through gradient descent closest to the loss function. Adjusting weights or synthesizing information from preceding layers is more favorable in these layers.

Fig. 7
figure 7

The role of the Priority Mixer block in the entire model

To clarify the function of the Priority Mixer block, in Fig. 6, we use some samples to illustrate the internal process of this block. Before entering PCA, the feature maps are diverse and not specifically focused on any region. After passing through the PCA block, its sole task is to synthesize important features per channel, partially condensing the feature maps, but not yet clearly defined. After going through PSA++, the features are further contracted, focusing on a visible region. Additionally, we provide another example of the importance of the Priority Mixer in Fig. 7. It helps the feature maps to be more focused on the target segmentation class.

Fig. 8
figure 8

Visualization of the sequential process through the proposed Depthwise-Focus Block

In Fig. 8, we visualized the importance of the first Depthwise-Focus Block in our proposed. The Depthwise-Focus Block consists of three branches: depthwise convolution block, wide block, and residual block. The wide block utilizes depthwise convolutions both vertically and horizontally, while employing different expansion factors to synthesize diverse essential information. Meanwhile, the depthwise convolution branch emphasizes local features, and the residual block helps retain certain characteristics. In a sample as shown, the wide block has effectively fulfilled its role. The remaining branches not only contribute to preserving information but also have the ability to diversify feature maps. Although they may introduce some noise and might not be optimal for the final segmentation purpose, their use in combination with multiple layers in the model provides better directionality and effectively exploits their function.

Evaluation on the Sunnybrook Dataset

We first show the performance of the proposed model against different models on the Sunnybrook dataset. The results from Fig. 9 show that the segmentation by the proposed model is in better agreement with those by other models. For quantitative assessment, we provided the DSC and IoU scores by comparative models on the Sunnybrook data in Table 1. In addition, the statistical significance analysis by the p-value is also given in this table. The tests are made to check whether there is a difference between the segmentation quality measures by different models and our proposed model.

Table 1 The quantitative comparison between the proposed CapNet and SOTA on the Sunnybrook data. DSC and IoU scores are in mean (standard deviation)
Fig. 9
figure 9

Segmentation results of top 5 on the Sunnybrook dataset

Table 1 shows that the proposed CapNet exhibits a statistically significant improvement in both DSC and IoU compared to TransUNet (p = 0.0336 for DSC, p = 0.0261 for IoU), Swin-Unet (p =\(8.727 \times 10^{-4}\) for DSC, p = \(3.503 \times 10^{-4}\) for IoU), Res-Unet (p = \(3.710 \times 10^{-4}\) for DSC, p =\(1.019 \times 10^{-4}\) for IoU), DS-TransUnet (p = \(1.03110^{-3}\) for DSC, p = \(4.612 \times 10^{-4}\) for IoU), U-Net (\(\textit{p} = 5.776 \times 10^{-4}\) for DSC, p = \(2.296 \times 10^{-4}\) for IoU), Attention-Unet (p = 0.0291 for DSC, p = 0.0187 for IoU), and SegNet (p = 0.0125 for DSC, p =\(9.536 \times 10^{-3}\) for IoU) for the endocardium. For the epicardium, the statistical values are significant compared to most methods, except MSU-Net, U-Net, and U-Net++. Nevertheless, considering the number of training parameters, as shown in the second column of Table 1, the proposed CapNet has significantly fewer parameters compared to these models.

For better visualization, the boxplots showing the IoU and DSC scores by those models are also given in Fig. 10. As can be seen from this figure, with the smallest number of parameters, the proposed model has the highest values for both median and maximal values of IoU and DSC.

Fig. 10
figure 10

Boxplots of IoU and DSC scores on Sunnybrook dataset of different models for the endocardium (top), and epicardium (bottom)

Evaluation on the MRI Cardiac ACDC Dataset

In the first experiment for segmentation on the ACDC data, we show the performance of left ventricle segmentation including endocardium and epicardium by the proposed model with compared models in Fig. 11. From the representative segmentation in this figure, we can observe that the predicted masks by our model are in best agreement with ground truths for both epicardial and endocardial regions in short-axis images including apex, mid, and base slices.

Fig. 11
figure 11

Representative segmentation results of top 5 on the left ventricle of the ACDC dataset

The quantitative results for the left ventricle are also given in Table 2. The scores in this table show that in the endocardium, the proposed CapNet gives better values for both DSC and IoU, 94.49% and 90.15% respectively, than almost all mentioned models. Specifically, U-Net++ gives a DSC of 92.65% (p=\(1.942 \times 10^{-3}\)) and IoU of 86.78% (p=\(7.152 \times 10^{-5}\)); SegNet achieves a DSC of 91.76% (p=\(7.777 \times 10^{-10}\)) and IoU of 81.47% (p=\(8.933 \times 10^{-13}\)); Res-Unet scores 90.97% DSC (p=\(1.910 \times 10^{-5}\)) and 85.01% IoU (p=\(1.489 \times 10^{-6}\)); DS-TransUnet records a DSC of 90.44% (p=\(3.008 \times 10^{-8}\)) and IoU of 85.03% (p=\(3.519 \times 10^{-13}\)); and Swin-Unet attains a DSC of 90.02% (p=\(9.613 \times 10^{-10}\)) and IoU of 82.96% (p=\(1.132 \times 10^{-14}\)). For the epicardium, the p-values computed for all models are smaller than 0.05, showing a significant performance of the proposed approach compared to other models. CapNet achieves superior DSC and IoU values of 96.82% and 93.93%, respectively, highlighting its robust performance. These results emphasize CapNet’s remarkable improvement over other state-of-the-art models.

For better quantitative assessment, we provide the quantitative results by the boxplots of compared models in terms of DSC and IoU in Fig. 12. As can be easily observed from these figures, our proposed approach gives the highest medium and maximal scores for both DSC and IoU indices compared to comparative models.

Fig. 12
figure 12

Boxplots of IoU and DSC scores of endocardium (top) and epicardium (bottom) on left ventricular ACDC dataset of different models

Table 2 The quantitative comparison between the proposed CapNet and SOTA on the left ventricle of ACDC dataset. DSC and IoU scores are in mean (standard deviation)

In addition, we show the performance of multiclass segmentation on the ACDC dataset in Fig. 13. The segmented regions include the right ventricle (RV), myocardium (MYO), and left ventricle (LV). As can be seen from this figure, the segmentation by the proposed CapNet is close to ground truths, while the under-segmentation occurs in results by the U-Net.

Fig. 13
figure 13

Representative segmentation results of top 5 of the right ventricle, myocardium, and left ventricle on the ACDC dataset

Table 3 The quantitative comparison between the proposed CapNet and SOTA on multiclass of ACDC dataset. DSC score is in mean (standard deviation)

For multiclass segmentation, to better quantitatively assess, we show the boxplots of compared models in terms of DSC in Fig. 14. As can be seen from these figures, compared to other models, our proposed approach gives the highest medium and maximal scores in terms of DSC scores for all regions including the RV, Myo, LV areas, and AVG values.

Fig. 14
figure 14

Boxplots of DSC scores of different models for multiclass segmentation on ACDC dataset

Table 4 Comparison between the proposed CapNet model and SOTA on the MS-CMR dataset with different sequences. (a) The bSSFP cine CMR sequence images. (b) The T2-SPAIR CMR sequence images. (c) The LGE CMR sequence. DSC scores are in mean (standard deviation)

In order to compare the evaluation scores, we provided the DSC scores by comparative models on the ACDC data in Table 3 for segmented regions including the right ventricle (DiceRV), myocardium (DiceMYO), and left ventricle (DiceLV), and the average values of the three regions (DiceAvg).

The quantitative comparison of the proposed CapNet with state-of-the-art models on the ACDC dataset is presented in Table 3. For the right ventricle (RV), CapNet achieves a DSC of 92.34%, significantly outperforming SegNet (p=\(1.235 \times 10^{-3}\)), Res-Unet (p=\(4.116 \times 10^{-3}\)), DS-TransUnet (p=0.0125), TransUNet (p=\(1.361 \times 10^{-6}\)), and Swin-Unet (p=\(1.784 \times 10^{-4}\)). In the myocardium (Myo), CapNet’s DSC of 90.95% is notably better than U-Net++ (p=0.0450), SegNet (p=\(3.984 \times 10^{-5}\)), Res-Unet (p=\(8.653 \times 10^{-3}\)), DS-TransUnet (p=\(4.294 \times 10^{-4}\)), TransUNet (p=\(1.485 \times 10^{-9}\)), and Swin-Unet (p=\(2.353 \times 10^{-8}\)). For the left ventricle (LV), CapNet achieves a DSC of 95.86%, surpassing SegNet (p=\(9.66 \times 10^{-3}\)), Res-Unet (p=0.0276), DS-TransUnet (p=0.0195), TransUNet (p=\(2.975 \times 10^{-4}\)), and Swin-Unet (p=\(1.951 \times 10^{-5}\)). Overall, CapNet achieves an average DSC of 93.05%, demonstrating significant improvements over SegNet (p=\(1.706 \times 10^{-5}\)), Res-Unet (p=\(2.673 \times 10^{-4}\)), DS-TransUnet (p=\(9.892 \times 10^{-5}\)), TransUNet (p=\(6.289 \times 10^{-10}\)), and Swin-Unet (p=\(2.373 \times 10^{-11}\)). These results clearly illustrate the remarkable performance of CapNet across all evaluated metrics. It is worth mentioning that, compared to those models, our model has the smallest number of parameters as shown in the second column of Table 3.

Evaluation on the MS-CMR Dataset

We conducted experimental studies on the MS-CMR 2019 dataset to investigate segmentation. We present the performance of segmentation specifically on three subsets (bSSFP cine, T2-SPAIR, LGE) of CMR sequence images, which were evaluated using the Dice coefficients for the right ventricle (DiceRV), myocardium (DiceMYO), and left ventricle (DiceLV). Additionally, we calculated the average Dice coefficient (DiceAvg) for the three regions. These evaluations were performed using the proposed model and compared to other models as shown in Fig. 15.

In Fig. 15, we present the top 5 models with the best mean DSC results. As shown in this figure, the segmentation results by the proposed model are best close to the ground truth on all three image sets: bSSFP, T2, and LGE. In panel a of this figure, the small slice masks (bottom) make it difficult to capture the details of Myo and RV, resulting in significant discrepancies among the compared models. Furthermore, in panel b, the slice masks (middle) exhibit over-segmentation in the compared models. Finally, in panel c, for the LGE cine sequence image masks, discrepancies with the ground truth among the models will occur in small-sized masks such as U-Net and Attention-Unet.

Fig. 15
figure 15

Representative segmentation results of the right ventricle, myocardium, and left ventricle on the MS-CMR 2019 dataset with different CMR sequence images a LGE CMR, b bSSFP, and c T2 CMR

Fig. 16
figure 16

Boxplots of Dice similarity coefficient on the LGE CMR sequence images in MS-CMR 2019 dataset for multiclass segmentation

The quantitative results for the MS-CMR 2019 dataset are also provided in Table 4. Across the entire MS-CMR 2019 dataset, based on the computed p-values, we can see that the proposed model outperforms most other models, excepting the U-Net++, Attention-Unet, MSU-Net, and nnUnet (for the T2-SPAIR sequences), in terms of performance in the majority of regions (RV, Myo, LV).

Our CapNet model demonstrates outstanding performance across different CMR sequence images when compared to several state-of-the-art models, particularly those with statistically significant p-values (less than 0.05). In the bSSFP cine CMR sequence images, CapNet achieves DSC scores of 94.65% for RV, 92.05% for Myo, 97.06% for LV, and an average DSC of 94.59%, which significantly outperforms SegNet, Res-Unet, DS-TransUnet, TransUNet, and nnUNet in terms of DSC scores for the RV, Myo, and LV, as well as the average DSC score (p\(<0.05\)).

In the T2-SPAIR CMR sequence images, CapNet achieves DSC scores of 90.47% for RV, 90.97% for Myo, 95.21% for LV, and an average DSC of 92.22%. Thus, our model shows superior performance compared to SegNet (p=0.0153 for RV, p=0.0235 for Myo, p=0.0359 for LV, and p=\(1.274\times 10^{-3}\) for the average DSC), Res-Unet (p=\(5.474\times 10^{-6}\) for RV, p=\(1.162\times 10^{-3}\) for Myo, and p=\(8.942\times 10^{-7}\) for the average DSC), DS-TransUnet (p=\(6.955\times 10^{-5}\) for RV, p=\(2.291\times 10^{-3}\) for Myo, and p=\(2.335\times 10^{-7}\) for the average DSC), and TransUNet (p=\(1.092\times 10^{-6}\) for RV, p=\(5.318\times 10^{-10}\) for Myo, p=\(2.997\times 10^{-5}\) for LV, and p=\(4.108\times 10^{-14}\) for the average DSC). Additionally, for the LGE CMR sequence images, our model obtains DSC scores of 94.77% for RV, 91.24% for Myo, 95.96% for LV, and an average DSC of 93.99%. This shows that proposed model excels with significantly higher DSC scores compared to SegNet (p=\(2.535\times 10^{-3}\) for RV, p=\(1.241\times 10^{-5}\) for Myo, p=\(5.186\times 10^{-4}\) for LV, and p=\(1.780\times 10^{-6}\) for the average DSC), Res-Unet (p=\(3.462\times 10^{-9}\) for RV, p=\(1.218\times 10^{-9}\) for Myo, p=\(1.710\times 10^{-9}\) for LV, and p=\(6.573\times 10^{-14}\) for the average DSC), DS-TransUnet (p=\(6.159\times 10^{-3}\) for RV, and p=\(2.076\times 10^{-3}\) for the average DSC), TransUNet (p=\(4.316\times 10^{-4}\) for RV, p=\(1.500\times 10^{-5}\) for Myo, p=\(4.161\times 10^{-4}\) for LV, and p=\(1.142\times 10^{-7}\) for the average DSC), and Swin-Unet (p=\(1.607\times 10^{-3}\) for RV, p=0.0111 for Myo, p=\(4.228\times 10^{-4}\) for LV, and p=\(8.366\times 10^{-5}\) for the average DSC). These results clearly illustrate the exceptional performance of CapNet, making it a highly effective model for CMR image segmentation.

For better quantitative assessment, we provide the quantitative results using boxplots of compared models in terms of Dice scores in the LGE CMR sequence images of the MS-CMR dataset in Fig. 16. As can be easily observed from these figures, our proposed approach provides the highest mean and maximum scores for all the regions of interest in the dataset.

Ablation Study

Performance of the Hyperparameters \(\alpha\), \(\beta\), and m on the Proposed Loss

To find suitable values for the hyperparameters \(\alpha\) and \(\beta\), we fixed m\(= 2\) as the exponent parameter in the proposed loss function. Similar to the Tversky loss, we gradually vary \(\alpha\) and \(\beta\) by decreasing \(\alpha\) and increasing \(\beta\), with their sum equal to 1. The experimental results are evaluated as shown in Table 5. When \(\alpha = \beta = 0.5\), we obtained DSC Avg with (92.37%) as the result. After slightly reducing \(\alpha\) to 0.4 and increasing \(\beta\) to 0.6, there was a slight improvement in the results DSC Avg with (92.44%). Especially, when we decreased \(\alpha\) to a ratio of 3:7 with \(\beta\), we achieved the best DSC Average (AVG) value with DSC Avg (93.05%). However, when we decreased \(\alpha\) = 0.2, \(\beta\) = 0.8, and \(\alpha\) = 0.1, \(\beta\) = 0.9, the performance decreased compared to the best ratio of 3:7.

Table 5 The experiment comparison between different parameters \(\alpha\) and \(\beta\) on the ACDC data and statistical analysis. DSC scores are in mean (standard deviation)

After finding suitable values for the parameters \(\alpha\) and \(\beta\), we will fix them at \(\alpha = 0.3\) and \(\beta = 0.7\) and gradually vary the exponent parameter m. As shown in Table 6, we obtained results for m\(= 1\), which were not satisfactory, with DSC Average (Avg) on the ACDC data (91.64%). However, as we increased m to values greater than 1, specifically based on the data in the table, when m\(= \frac{4}{3}\), the results gradually improved with a DSC Average (92.20%). Subsequently, when we increased m slightly to m\(= 2\), we achieved good results as shown in the table, with a DSC Average (Avg) (93.05%). Overall, the best performance is achieved when the exponent parameter m ranges from \(\frac{4}{3}\) to 3, with particularly good results at m\(= 2\).

Table 6 The experiment comparison between different parameters power m on the ACDC data and statistical analysis. DSC scores are in mean (standard deviation)

To check whether there are any statistical differences between segmentation scores when using various combinations of \(\alpha\), \(\beta\), and m, we computed the p-values by the statistical tests. In particular, we compute the p-values on DSC scores when using other combinations with our chosen hyperparameters, \(\alpha = 0.3\) and \(\beta = 0.7\) (last row of Table 5), as well as the chosen \(m=2\) (last row) in Table 6. The p-values show no statistical differences when using various hyperparameter combinations. This also implies that the proposed loss is not too statistically sensitive to hyperparameters.

Performance of the Proposed Loss

Experiments to assess the performance of the proposed loss are provided in Table 7. In these experiments, the proposed model is trained with some common loss functions including the Tversky, focal Tversky, BCE-Dice, and active contour losses. In the first experiment, we conduct the binary segmentation of the endocardium and epicardium on the Sunnybrook dataset. As shown in Table 7(a), the TSPD loss produces superior results compared to other losses in terms of the average DSC and IoU metrics for both endocardium and epicardium regions. However, considering the p-value, the scores by the TSPD loss are not statistically significant compared to other losses. The second experiment is evaluated on the case of multiclass segmentation performed on the ACDC dataset. The quantitative results are given in Table 7(b). As can be observed from the table, the results on the ACDC dataset clearly demonstrate the superior effectiveness of the proposed loss function. The average Dice score of 93.05% is 0.7% higher than the result ranked second. Another metric, the DiceLV, also indicates the superiority of the results in terms of the proposed loss function. The p-values in the last column of Table 7(b) show significant differences compared to the Dice, BCE-Dice, Tversky, and focal Tversky losses.

Table 7 Comparison between the proposed Tversky Shape Power Distance (TSPD) and other losses in the case of (a) binary segmentation and (b) multiclass segmentation

Performance of the PCA-PSA++ Architecture in the Bottleneck

To demonstrate the effectiveness of the PCA-PSA++ model, we have attempted to replace it with several modules to compare their performance. The compared results are provided in Table 8. Although the number of parameters may not be optimal, the trade-off is that the performance in terms of the dice score significantly surpasses that of other modules. Previous modules only demonstrated effectiveness with ASPP, but had significantly longer computation times. Compared to the previous version, PCA-PSA, the increase in the number of parameters is not significant. The metrics for Myo may show relatively similar values, but the parameters for RV and LV exhibit a significant increase, particularly in the case of RV.

Table 8 Comparison between the PCA-PSA++ architecture and other architectures on the ACDC data. Dice (DSC) scores are in mean (standard deviation)

Performance of the Depthwise-Focus Architecture in the Decoder

With another contribution in the paper, the Depthwise-Focus, we created Tables 9 and 10 to examine whether the depthwise axial with a kernel size of 7 is truly beneficial for the model. We sequentially replaced the conventional convolution and axial convolution to compare them with the proposed depthwise axial method used in our model. In Table 9, we kept the kernel size as 7 and added increasing dilations from 1 to 4 to observe the effectiveness. All three methods yield corresponding results, with the Dice score gradually increasing from \(d=1\) to the combination of \(d=1\) and \(d=2\), reaching its peak when the combination of \(d=1, d=2\), and \(d=3\) is used. However, when dilation 4 is added, the results relatively decrease. The depthwise axial method also demonstrates effectiveness when at the same dilation level, as the DiceLV consistently shows higher values compared to the other two methods. We also experimented with different kernel sizes such as 3, 5, and 7. The results with a kernel size of 5 were lower compared to the other two kernels, while the kernel size of 7 demonstrated dominance across all three methods. However, using a regular convolution with a kernel size of 7 would lead to a significant increase in the number of parameters. On the other hand, with depthwise convolutions, there is no significant difference in the number of parameters between kernel sizes 3, 5, and 7. Indeed, the application of depthwise axial convolutions has yielded favorable results while also reducing the number of parameters.

Table 9 Comparison of mean DSC values of RV, MYO, LV, and their average on ACDC dataset with kernel size \(k = 7\) and the increasing dilation d. DSC scores are in mean (standard deviation)
Table 10 Comparison of DSC values of RV, MYO, LV, and their average value on the ACDC dataset with dilation \(d = (1,2,3)\) and the increasing kernel size k. DSC scores are in mean (standard deviation)

Discussion

Contribution of the Study

This study presents the CapNet model, which is based on the mechanism of attention clustering for local information and incorporates a smooth feature flow processing using the proposed Priority Mixer block at the bottleneck. Additionally, a decoding module namely Depthwise-Focus block is employed, leveraging creative convolutional techniques to enhance the accuracy of the predicted labels. Besides, we propose a new loss called Tversky Shape Power Distance (TSPD) loss function. We conducted experiments on various datasets to demonstrate the effectiveness of our proposed architecture and loss function compared to other methods that utilize different loss functions. Specifically, we performed experiments on well-known datasets used for cardiac segmentation: Sunnybrook, ACDC, and MS-CMR datasets. The segmentation performances are evaluated by DSC and IoU metrics, and the statistical significance analysis by a statistical test with p-value is made. Our results show that the TSPD loss function consistently outperforms other loss functions in most cases.

By experiments, we found that in the context of cardiac image segmentation based on deep learning, the CNN-based methods can outperform the transformer-based models. The transformer-based approach, though having shown performances in many computer vision areas, still suffers from drawbacks when working with limited training data. In our study, the results on the three datasets, including Sunnybrook, ACDC, and MS-CMR, show that compared to the transformer-based methods like TransUnet, DS-TransUnet, and Swin-Unet, the proposed CapNet still gives better scores in terms of DSC and IoU scores and shows statically significant differences.

On another hand, the current CNN-based or transformer-based approaches for cardiac MRI image segmentation still entail a large number of parameters. This motivated us to build a lightweight model. To the best of our knowledge, there are a few prior lightweight models specifically designed for this task. The proposed CapNet model is only with 1.53 million parameters, 20 times less than the well-known U-Net. With fewer parameters, we can reduce the memory size and computational complexity of the segmentation tasks, especially for edge devices.

Regarding using the PCA-PSA++ architecture, the PCA, PSA, and its variant, PSA++, do not possess any learnable parameters. They function as normalization transformations or systems that reorient the outputs. Would incorporating learnable vectors or matrices into PSA/PSA++ or PCA yield significant improvements? In practice, when introducing learnable vectors or matrices, these values are random and have no predefined upper or lower bounds. The possibility of having excessively large gradients may result in the learned parameters being updated with very large or very small values, leading to instability in learning. Furthermore, when applying normalization functions to these vectors or matrices within the range of 0–1, such as sigmoid or softmax, \(\lim _{x \rightarrow -\infty } \sigma (x) = \lim _{x \rightarrow -\infty } \ \frac{1}{1 + e^{-x}} = 0\), it tends to approach 0 when x is very small. Similarly, softmax tends to have one dominant component, while the remaining components converge to zero. Consequently, there is a significant loss of information when using these functions. Another approach could be to set upper and lower bounds for the parameters, but it is not possible to exclude the possibility that all parameter values will be at the lower bound. Alternatively, using sinusoidal functions like sin() or cos() for normalization within the range of 0–1 may not guarantee satisfactory results due to their periodic nature. Therefore, we propose PSA/PSA++ and PCA without incorporating parameters to demonstrate their passive adaptive capability.

Data Sampling Size

Considering the data sampling size, it is evident that the initial datasets, including Sunnybrook, ACDC, and MS-CMR, consist of a relatively small number of 3D volume samples. For example, the ACDC dataset includes 100 samples, with 70 for training, 10 for validation, and 20 for testing. However, we have taken steps to increase the effective size of the dataset by slicing the 3D volumes along the z-axis, resulting in a larger number of 2D slices. Specifically, the 70 training samples of the ACDC were converted into 1312 slices for training, the 10 validation samples were split into 202 slices for validation, and the 20 test samples were transformed into 388 slices for testing.

In a similar way, we applied this technique for the cine-MRI dataset of Sunnybrook, with 70% for training, 15% for validation, and 15% for testing. With epicardium, 70% training was converted into 191 slices, 15% for validation to 41 slices, and 41 remaining 2D slices for testing, and then endocardium with 70% for training (369 slices), 15% for validation (79 slices), and 15% for testing (79 slices). The same approach is also applied for the MS-CMR dataset. In this data, we categorize it into 3 types: bSSFP, T2, and LGE. We have performed preprocessing and obtained 333 slices for bSSFP, 148 slices for T2, and 75 slices for LGE in the dataset. By utilizing this slicing technique, we effectively increased the size of the dataset and the number of independent samples available for training, validation, and testing. This approach not only augments the dataset but also captures the inherent variation and diversity present within the 3D volumes, enhancing the generalizability and robustness of our model.

While we acknowledge that the initial 3D volume sample size was small, the slicing technique allowed us to leverage a significantly larger number of 2D slices, mitigating potential limitations in generalizability and robustness. The increased dataset size and diversity of samples provided a more comprehensive representation of the problem domain, enabling our model to learn and generalize more effectively. Additionally, we employed various data augmentation techniques, such as rotation, flipping, and scaling, to further increase the diversity of the training data and improve the model’s ability to generalize to unseen samples. We understand the importance of validating our approach on a larger and more diverse dataset, and we will continue to explore opportunities to expand our dataset further. However, we believe that the slicing technique and data augmentation strategies employed in this study have effectively addressed the potential limitations of the initial sample size.

Result Discussion and Hyperparameter Settings

To evaluate the performance of the proposed approach, we have conducted experiments using the CNN and transformer-based approaches on the binary and multiclass segmentation tasks. We reimplemented all SOTA models on three datasets for assessing the quantitative results, plotting, and statistical analysis. Considering the average values of DSC and IoU scores, our CapNet model obtained better performance compared to the SOTA for both the binary and multiclass segmentation cases. The statistical significance of the proposed method is shown compared with transformer-based methods like TransUnet and Swin-Unet, with p-value \(<0.01\).

In particular, for the binary segmentation, with Sunnybrook data, the proposed model gets DSC of 94% and IoU of 88.95% for endocardium and DSC of 95.93% and IoU of 92.30% for epicardium. With the same data, the average scores by the U-Net are 90.55% (DSC) and 83.51% (IoU) for endocardium and 94.92% (DSC) and 90.71% (IoU) for epicardium. The scores by DS-TransUnet are DSC of 91.33% and IoU of 84.67% for endocardium and 93.86% (DSC) and 88.70% (IoU) for epicardium. For the binary segmentation on the ACDC data, the CapNet obtains the DSC of 94.49% (endocardium) and 96.82% (epicardium) while the DSC scores by the Attention-Unet are 93.01% (endocardium) and 96.07% (epicardium). The results by the proposed model outperform those by the Swin-Unet (DSC of 90.02% for endocardium and 93.89% for epicardium). The IoU by CapNet is 90.15% for endocardium and 93.93% for epicardium, whereas the IoU values by the TransUNet are 88.02% for endocardium and 90.92% for epicardium on the ACDC data.

For the multiclass segmentation case, we conducted experiments on the ACDC and three sequences of the MS-CMR data for the right ventricle (RV), myocardium (Myo), left ventricle (LV), and the average regions. Similar to the binary case, the proposed CapNet model outperforms the SOTA in terms of average values of DSC and IoU and shows statistically significant differences compared to the transformer-based approach such as DS-TransUnet, TransUNet, and Swin-Unet. For the ACDC data, the mean DSC scores of CapNet are 92.34% (RV), 90.95% (Myo), and 95.86% (LV); meanwhile, the scores for the corresponding regions by the U-Net++ are 89.58% (RV), 89.52% (Myo), and 94.82% (LV). The scores by the Res-Unet are even lower, with 88.08% (RV), 88.89% (Myo), and 93.93% (LV). With MS-CMR data, notably, the proposed CapNet gives the DSC scores of 94.65% (bSSFP sequence), 90.47% (T2 sequence), and 94.77% (LGE sequence) for the RV, while the corresponding values by the nnUNet are 91.47% (bSSFP sequence), 89.55% (T2 sequence), and 92.34% (LGE sequence), and TransUNet are 91.90% (bSSFP sequence), 82.12% (T2 sequence), and 90.00% (LGE sequence). For the statistical analysis, the p-values show the significant differences while comparing the proposed CapNet with transformer-based methods, and some CNN-based methods like SegNet and Res-Unet. However, although the proposed model gives better performance in terms of average values, the statistical tests on the IoU and DSC by the proposed CapNet show that there is no difference when comparing with some CNN-based models like U-Net, nnUNet, and Attention-Unet. Nevertheless, it is worth mentioning that the proposed model is with much less parameters, 1.53 M, while the parameters of the U-Net are 31.1 M. The number of parameters in the nnUNet is 37.6 M, and Attention-Unet are 31.9 M.

Besides building the lightweight model for less parameters, developing a suitable loss for training the neural networks is also a promising approach. Building a loss does not increase any training parameters for the model, but instead can improve the performance significantly. In this study, inspired by the traditional optimization-based segmentation framework based on active contour and level set models, we build a novel region-based loss namely the Tversky Shape Power Distance. The loss allows adjusting the false positive rate or false negative rate (by choosing the \(\alpha\) and \(\beta\)) to be compatible with the data. The loss can also be extended to the multiclass segmentation as in Eq. 6. Nevertheless, one of the major concerns in building the region-based losses [38, 39] for the segmentation tasks is choosing the hyperparameters for the false positive and the false negative (\(\alpha\) and \(\beta\)). In fact, we need to conduct experiments to explore various combinations to estimate a suitable range and then choose suitable hyperparameters for the cardiac data. Based on these experiments, we found that the values of \(\alpha\)= 0.3 and \(\beta\)= 0.7 yielded the best performance for the binary segmentation task, i.e., the endocardium and epicardium in the Sunnybrook dataset.

Building upon these findings, we extended our experiments to the multiclass segmentation task using the loss for multiclass segmentation in Eq. 8, using the ACDC and MS-CMR datasets. Although the ACDC and MS-CMR datasets involve four classes, we found that the same values \(\alpha\)= 0.3 and \(\beta\)= 0.7 also provided the best overall performance, as shown in Table 5. We believe that these hyperparameter values strike a good balance between the topological similarity term and the pixel-wise similarity term in the TPSD loss function, enabling effective segmentation for both binary and multiclass scenarios. However, we acknowledge that these hyperparameters may not be the optimal values for all datasets and segmentation tasks. In our future work, we plan to explore more advanced techniques for hyperparameter tuning, such as automated hyperparameter optimization methods to further improve the performance of our TPSD loss function.

Limitations and Considerations

In addition to the notable strengths of this study that we have outlined above, there are still some points that we would like to further discuss regarding the limitations of our study. First, as demonstrated in Tables 1, 2, 3, and 4, the p-values indicating the performance superiority of the proposed CapNet model over models such as U-Net, nnUnet, and Attention-Unet still remain high. This implies that the proposed CapNet model does not convincingly exhibit higher performance than U-Net or Attention U-Net. However, when comparing the number of parameters, the proposed CapNet model is lightweight with significantly fewer parameters than Attention-Unet, nnUnet, and U-Net, by several orders of magnitude. We propose this CapNet model with the aim of balancing performance and parameter efficiency. The lack of statistical significance in the performance improvement could be due to the sample size in the data. Our future work could involve increasing the sample size, optimizing model parameters further, or exploring additional features to enhance the model’s performance and achieve statistical significance.

Besides, setting the hyperparameters \(\alpha\), \(\beta\), and m for training the initial model is quite challenging. To obtain a good set of hyperparameters, one needs to explore several parameter combinations to estimate a suitable range of values, which can be time-consuming. However, the p-value metric shows consistent results across different sets of hyperparameters \(\alpha\), \(\beta\), and m within certain predefined ranges, as demonstrated in Tables 5 and 6, where \(\alpha\) and \(\beta\) are both range from [0,1] and \(1 \le m \le 6\). This indicates that the model remains robust and stable as the hyperparameters \(\alpha\), \(\beta\), and m vary within predefined ranges.

Conclusion

We have presented a new network model and a new loss for image segmentation of cardiovascular magnetic resonance images. The network is trained end to end with a quite small number of parameters. The proposed Priority Mixer block and Depthwise-Focus block with attention mechanism are applied for better learning information from anatomic organs which are variable in size and shape during cardiac phases. In addition, we propose a new loss called Tversky Shape Power Distance based on the dissimilarity in shape distances between the mask and the predicted label. Extensive experiments and ablation studies have been performed to prove the dominance of both the proposed architecture and the proposed loss function.