Keywords

1 Introduction

Cardiac Magnetic Resonance (CMR) imaging is the golden standard sequence for non-invasive evaluation of cardiac anatomical structures and functionalities [1]. Anatomical segmentation allows for analysis of what is of interest to the clinicians. By discarding irrelevant information, the images can be smaller in size, which can reduce the post-processing time and computing power for the downstream analysis. It is also a crucial pre-requisite for the calculation of several image-based bio-markers [2,3,4] with diagnostic value. Recently, with the development of artificial intelligence, fully-automatic segmentation algorithms based on deep learning start to surpass manual segmentation with faster speed, less subjective bias, and comparable or even higher accuracy. However, in clinical practice, the model performance highly relies on image qualities. For CMR acquisition, respiration motion is one of the major causes of degenerated image qualities, as it may be difficult for certain patients with acute symptoms to follow the instructions and hold their breath for a long time during the scan. Images contaminated by respiratory motion have seen a significant drop in model performance and result in obvious failure cases.

Currently, most automatic cardiac segmentation models are based on deep learning methods, which learn a function to map the input images to the segmentation masks. This method highly relies on the quantity and quality of the training data, and is based on the assumption that the hold-out data has alike distribution to the training data. In practical clinic scenarios, the number of available training data is limited, and the quality of most images is high to guarantee diagnostic value. With such training data as supervision, the model is trained to only perform well on clean images and can easily fail when encountering images with low quality.

Pre-training and data augmentation are two important data-driven methods which have proven effect in improving model robustness. These two methods focus on enlarging the training data and better utilizing the existing data respectively. Pre-training exposes the model to a large dataset other than the training set, so as to broaden the model’s horizon. It is recognized to yield better results than training from scratch [21, 22]. Although some researchers argue that pre-training offers little benefit for certain tasks or light weight architectures [23, 24], it is still undeniable that pre-training can enhance the model robustness and improve its performance on hold-out individuals [25, 26]. Data augmentation is another standard trial to build robust segmentation models. It exposes the network to higher variability through perturbations of the training data. This includes native approaches such as cropping, rotation, and flipping. As respiration motion will cause spatial transformation of the anatomic structure, deformation-based augmentation [6] is also rewarding for training. A recently proposed adversarial data augmentation method [5] can generate plausible and realistic signal corruptions that are difficult for the models to analyze and therefore increase the model’s adversarial robustness.

In this work, we explore these two data-driven approaches in the context of the CMRxMotion challenge [32]. The main goal of this challenge is to build a model for segmentation of the left ventricle (LV), left ventricular myocardium (MYO), and right ventricle (RV), based on limited training data. This model should be robust under different levels of respiration motion as the test data composes of images with diverse quality levels. We pre-train our model on large publicly available datasets with the same tasks, and increase the data variability through both random augmentation and adversarial augmentation. We find that, through intensive pre-training and strong data augmentation, even without novel DCNN architectures, the model can still reveal high robustness towards the qualities of the images, regardless of reasons for low-quality images, such as respiration motion.

2 Methods

We first give a brief introduction to the dataset of the CMRxMotion challenge and then explain our proposed approaches in detail.

2.1 Dataset

The CMRxMotion dataset [32] consists of short-axis (SA) cine MRI acquisitions of 45 healthy volunteers. Each volunteer is trained to act in 4 manners during the scanning process, namely a) stick to the breath-hold instructions, b) halve the breath-hold period, c) breathe freely, and d) conduct intensive breathing. The pixel sizes vary from \(\sim \)0.66 to \(\sim \)0.76 mm, the image resolution ranges from 400 to 512 pixels, the number of slices is between 9 and 13, and the slice thickness ranges from \(\sim \)9.6 mm to 10 mm. For images with diagnostic qualities, LV, MYO, and RV at the end of systole (ES) and end of diastole (ED) are manually segmented by radiologists. Exams of 20 volunteers with both images and ground-truth segmentation as well as 5 volunteers with only images are released for training and validation respectively. The remaining 20 volunteers are withheld for testing.

2.2 Network Architecture

We try three variants of U-Net: nnU-Net [14], Swin-UNETR [13], Swin-UNet [8].

nnU-Net [14] is a pure CNN-based method, which is good at catching the local pattern of the image. One modification we make is to replace convolution operation at the bottleneck layer with deformable convolution [10], which has shown increases in performance since it allows for a flexible receptive field [20]. The sampling offset learned by deformable convolutions is expected to counteract some of the shifts caused by respiratory motion.

On the contrary, Swin-UNet [8] is a pure transformer-based network, which is better at capturing global features through self-attention and shifted windows [11]. However, the segmentation result of raw Swin-UNet usually contains zigzag margins since Swin-UNet uses patch rather than pixel as the smallest unit to operate on. To overcome this limitation, we add two convolutional layers with layer normalization and leaky-relu at the end of the network, which helps produce smooth segmentation results.

Swin-UNETR [13] is a combination of transformer-based encoder and CNN-based decoder. which is expected to utilize the global information of the images and generate a refined segmentation map.

The input volumes only contain around 10 slices, which is not enough to be divided into multiple patches and windows. Therefore, for nnU-Net, we try both 2D and 3D variants, while for Swin-UNETR and Swin-UNet, only the 2D version is used.

2.3 Pre-training

Our training data is only composed of 20 healthy volunteers, which is too few for the model to even exhibit strong robustness towards the inherent variations of anatomical structures among different people, let alone unstable image quality or perturbations. Pre-training has proven to be effective in improving model performance as well as enhancing its robustness. To increase the model robustness towards unseen data with different appearances and qualities, we collect five public datasets of cine MRI with the same segmentation tasks listed in Table 1 for pre-training. We fuse these datasets and use labeled images from both the training phase and testing phase for pre-training. As transformer-based methods can benefit more from pre-training, the Swin-Unet is additionally pre-trained with ImageNet [19]. The encoder of Swin-UNETR is pre-trained with extra public CMR datasets without labels [28,29,30,31], following multi-task self-supervised learning manners [27]. Although no deliberate respiration motion is conducted during the acquisition of these images, they exhibit significant variations in many other aspects including but not limited to scanner type, acquisition center, protocols, and health condition of the subjects. We believe these variations allow the model to ignore pixel-wise noisy signals and turn to catch the essential and high-level features useful for cardiac segmentation.

Table 1. Public CMR dataset used for pre-training

2.4 Data Augmentation

We use both random data augmentation and adversarial data augmentation.

For random data augmentation, we follow the same schema as the default mode of nnU-Net [14], which contains rotation, scaling, Gaussian noise, Gaussian blur, brightness, contrast, simulation of low resolution, gamma correction, and mirroring.

As the inherent property of our data is that it is contaminated by respiratory motion, which can result in diffeomorphic deformation or spatial transformation of the raw image. Adversarial data augmentation proves to be more effective than random data augmentation in terms of improving model robustness towards a certain type of perturbations [7]. In adversarial data augmentation, given a model, the optimal perturbation of certain types is learned which will impair the model performance to the utmost extent. The model is then trained to resist this perturbation, which in our case, can be deformation caused by respiration motion. In this work, we apply AdvChain [5] to improve our model’s robustness towards diffeomorphic deformation and spatial transformation. To be more specific, we first train the network with random data augmentation, then fine-tune it with AdvChain. In each iteration of the fine-tuning phase, we turn off Gaussian noise, rotation, and scaling from the random data augmentation while keeping the rest. A perturbation consists of a series of Gaussian noise, spatial transformation, and diffeomorphic deformation is randomly generated, whose parameters are trainable. We freeze the network parameters and optimize the perturbation parameters to increase the consistency loss (MSE and contour loss [9] in our case) between two predicted segmentation maps before and after the perturbation. Finally, we fix this perturbation and update the network parameters by propagating supervised loss together with consistency loss. The above steps are conducted for multiple iterations until convergence.

2.5 Training Protocol

Pre-processing methods follow a similar framework to the default mode of nnU-Net [14], which consists of adjusting the pixel size of all images to \(\sim \)0.66 with third-order spline interpolation, cropping or padding the resulting images to the resolution of 512 \(\times \) 512 pixels, normalization, and intensity clipping (0.5 and 99.5 percentiles). For transformer-based methods, we crop the images to the same resolution of 224 \(\times \) 224 pixels. We posit the heart in the middle of the images using ground-truth segmentation labels (training set) or pseudo labels predicted by nnU-Net (validation and test set) as reference. We also apply min-max normalization to each image. For inference, we apply test time augmentation by mirroring along all axes. In post-processing steps, we select only the largest connected component of each structure in the predicted segmentation mask. We conduct this operation twice, in both a slice-wise manner and a volume-wise manner. And then we remove the RV from slices predicted to have no LV nor MYO. We train our model on an NVIDIA A100 GPU with 80 GB memory. All networks are implemented using the PyTorch framework. The optimization of nnU-Net follows its default setting. The Swin-UNETR is trained using the AdamW optimizer with initial learning rate of 0.0004 and weight decay of 0.00005. The Swin-UNet is trained using the SGD optimizer with initial learning rate of 0.05, weight decay of 0.0001 and momentum of 0.9. A weighted sum of cross-entropy loss and dice loss is used for back-propagation to optimize our model.

3 Experiments

We conduct a series of comparison and ablation studies so as to select the best network architecture (Sect. 3.1) and evaluate the effectiveness of different data-driven approaches (Sect. 3.2). We randomly split the 20 volunteers in the training set into five non-overlapping folders. For quick comparisons of proposed methods, we use four folders to train networks and use the rest folder to determine the convergence of the training process. The Dice coefficients and 95% Hausdorff distance for three anatomical structures of the online validation set are reported.

3.1 Architectural Variants

We first compare the performance of 2D nnU-Net, 3D nnU-Net, Swin-UNETR, and Swin-UNet. For a fair comparison, all models are pre-trained with public datasets. The results on the hold-out dataset are depicted in Table 2.

From the results, we find that 2D nnU-Net performs better than 3D nnU-Net, which agrees with existing literature and may be due to the large between-slice distance of the SA sequence [12]. We also find that transformer-based methods have comparable (slightly lower) results to CNN-based methods on LV segmentation and RV segmentation, but are inferior to CNN on MYO segmentation. Through assessing the performance on individual cases and slices, we further observe that the segmentation map generated by nnU-Net is very smooth and regular, while the results of the transformer have twisted margins, especially when the edge in the raw image is vague. We infer that the patch division and window partition process of the swin-transformer is the reason for rough segmentation maps. As the raw input is discretized into multiple windows, it can perform well in segmenting large and convex-shaped structures such as LV and RV but is hard to segment exquisite structures such as MYO.

Moreover, when comparing results of different respiratory motion intensities, we find that nnU-Net is more robust. The respiratory motion can result in an unclear margin between RV and its adjacent tissue, to which transformer-based methods are more vulnerable. The transformer is prone to include nearby objects with similar pixel intensities in the predicted RV segmentation map. In addition, when the margin of MYO is ambiguous, nnU-Net seems to follow a population-based prior to give a smooth and rounded segmentation map, while the results of transformation and be more distorted and angular, which are more faithful to the raw pixel intensities (Fig. 1).

Table 2. Quantitative comparison of proposed architectures on the online validation set of 5 volunteers, in terms of Dice and Hausdorff distance
Fig. 1.
figure 1

Qualitative segmentation results of different network architectures. From top to bottom are images from a single volunteer with breath-holding, half breath-holding, regular breathe and intensive breathe. Cases are selected from online validation set.

3.2 Data-Driven Methods with Pre-training and Augmentation

To evaluate the effectiveness of pre-training and adversarial data augmentations, we choose the vanilla 2D nnU-Net as our baseline. We then compare it to both a nnU-Net pre-trained on the public dataset (Sect. 2.3) and a nnU-Net with AdvChain (Sect. 2.4). The results are reported in Table 3.

From the results, we can conclude that both pre-training and adversarial augmentations improve performance compared to the baseline. Pre-training can significantly improve the segmentation accuracy of LV and MYO, while AdvChain promises considerable gain in RV segmentation. Composition of both methods can further enhance the segmentation accuracy.

3.3 Ensemble

Based on the above comparison and analysis, we combine the network architecture and data-driven methods proven to bring out improvements over their baseline in an ensemble manner as our final submission to the CMRxMotion challenge. To this end, we train four networks mentioned in Sect. 3.1 with pre-training and adversarial augmentations in a 5-fold cross-validation setting on the training dataset, and average the outputs of each network to obtain the ensemble prediction. Finally, in the online validation set, our model achieves dice scores of 0.9220, 0.8352, and 0.9069 for segmentation of LV, MYO, and RV respectively, and 95% Hausdorff distance of 8.07, 3.70, and 4.69. And in the test phase, our model achieves dice scores of 0.9372, 0.8738, and 0.9239 for segmentation of LV, MYO, and RV respectively, and 95% Hausdorff distance of 3.13, 2.57, and 3.59, which is ranked the first place among all participants.

4 Discussion and Conclusion

In this work, we propose a data-centric model for the tasks of cardiac segmentation which can achieve high performance even when the image quality is degenerated by intensive repository motion. We compare multiple networks or methods and find that pre-training and adversarial augmentation are two effective data-driven approaches that can significantly improve the model performance. Deep learning is essentially a data-driven methodology, whose performance highly relies on data quantity and quality. Collecting sufficient data and making utmost use of the data is second to nothing in terms of improving model performance and robustness. We believe this philosophy can guide the real application of deep learning. Instead of designing novel DCNN architectures or fancy training schema, rethinking how to utilize the data can be far more important.

Table 3. Segmentation performance of baseline, pre-training, and AdvChain, in terms of Dice and Hausdorff distance.

Intensive respiratory motion impairs the images by making its margin indistinct so that it becomes hard for the model to decide which structure each pixel belongs to. During the experiments, we find even if the boundary between LV and MYO or between MYO and RV is ambiguous, the model can always predict it well. However, it is hard for the model to delineate the boundary between MYO, RV, and their surrounding tissues, especially for RV, which shows great irregularity and variability in terms of shape and pixel intensity. Currently, we are mainly using some common data augmentations such as rotation or flip. In the future, we may design some exclusive perturbations that can better reflect this fuzzy-boundary property.

Another future direction entails the use of inter-slice information. Although the performance of 3D nnU-Net is inferior to that of 2D nnU-Net due to discontinuity between slices, we find that the ensemble of both models results in a better performance. Therefore, we believe the inter-slice information is beneficial for the cardiac segmentation task. Furthermore, the influence of respiratory motion on slices may differ from each other. A well-designed architecture may utilize the information of clean slices to help do segmentation on dirty slices.