1 Introduction

Cardiac segmentation from late-gadolinium enhanced (LGE) cardiac magnetic resonance (CMR) which highlights myocardial infarcted tissue is of great clinical importance, enabling quantitative measurements useful for treatment planning and patient management. To this end, the segmentation of the myocardium is an important first step for myocardial infarction analysis.

Since manual segmentation is tedious and likely to suffer from inter-observer variability, it is of great interest to develop an accurate automated segmentation method. However, this is a challenging task due to the fact that (1) the infarcted myocardium presents an enhanced and heterogeneous intensity distribution different from the normal myocardium region and (2) the border between infarcted myocardium and blood pool appears blurry and ambiguous [1]. While the borders of the myocardium can be difficult to delineate on LGE images, they are clear and easy to identify on the balanced steady-state free precession (bSSFP) CMR images, which have high signal-to-noise ratio and whose contrast is less sensitive to pathology (see red arrows in Fig. 1(a)). Conventional methods [2, 3] use the segmentation result from the bSSFP CMR of the same patient as prior knowledge to assist the segmentation on LGE CMR images. These methods generally require accurate registration between the bSSFP and LGE images, which can be challenging as the imaging field-of-view (FOV), image contrast and resolution between the two acquisitions can vary significantly [1, 4]. Figure 1(b) visualizes the discrepancy between the intensity distributions of the two imaging modalities in the cardiac structures (specifically, left ventricle (LV), myocardium (MYO), and right ventricle (RV)).

Fig. 1.
figure 1

The differences of image appearance (a) and intensity distributions (b) in the cardiac region (the union of LV, MYO, RV) between LGE images and bSSFP images (Color figure online)

Most recently, a deep neural network-based method has been proposed to segment the three cardiac structures directly from LGE images [5], reporting superior performance. However, this supervised segmentation method requires a large amount of labelled LGE data. Because of the heterogeneous intensity distribution of the myocardium in LGE images and the scarcity of experienced image analysts, it is difficult to perform accurate manual segmentations on LGE images and collect a large training set, compared to that on bSSFP images.

In this paper, we present a fully automatic framework that addresses the above mentioned issues by training a segmentation model without using manual annotations on LGE images. This is achieved by transferring the anatomical knowledge and features learned on annotated bSSFP images, which are easier to acquire. Our framework mainly consists of two neural networks:

  • A multi-modal image translation network: this network is used for translating annotated bSSFP images into LGE images through style transfer. Of note, the network is trained in an unsupervised fashion where the training bSSFP images and LGE images are unpaired. In addition, unlike common one-to-one translation networks, this network allows the generation of multiple synthetic LGE images conditioned on a single bSSFP image;

  • A cascaded segmentation network for LGE images consisting of two U-net [6] models (Cascaded U-net): this network is first trained using the labelled bSSFP images and then fine-tuned using the synthetic LGE data generated by the image translation network.

The main contributions of our work are the following: (1) we employ a translation network that can generate realistic and diverse synthetic LGE images given a single bSSFP image. This network enables generative model-based data augmentation for unsupervised domain adaptation, which not only closes the domain gap between the two modalities, but also improves the generalization properties of the following segmentation network by increasing data variety; (2) we demonstrate that the proposed two-stage cascaded network, which takes both anatomical shape information and image appearance information into account, produces accurate segmentation on LGE images, greatly outperforming baseline methods; (3) the proposed framework can be easily extended to other unsupervised cross-modality domain adaptation applications where labels of one modality are not available.

2 Methodology

The proposed method aims at learning an LGE image segmentation model using labelled bSSFP \(\{ ({\varvec{{x}}}_b, {\varvec{{y}}}_b) \}\) and unlabelled LGE \(\{ {\varvec{{x}}}_l \}\) only. Specifically, the proposed method is a two-stage framework. In the first stage, an unsupervised image translation network is trained to translate each bSSFP image \({\varvec{{x}}}_b\) into multiple instances of LGE-like images, noted as \(\{{\varvec{{x}}}_{bl}\}\). In the second stage, these LGE-stylized bSSFP images are used together with their original labels \(\{({\varvec{{x}}}_{bl}, {\varvec{{y}}}_b)\}\) to adapt an image segmentation network pre-trained on labelled bSSFP images to segment LGE images.

2.1 Image Translation

We employ the state-of-the-art multi-modal unsupervised image-to-image translation network (MUNIT) [7] as our multi-modal image translator. Let \(\{{\varvec{{x}}}_{l}\}\) and \(\{{\varvec{{x}}}_{b}\}\) denote unpaired images from the two different imaging modalities (domains): LGE and bSSFP, given an image drawn from one domain as input, the network is able to change the appearance (i.e. image style) of the image to that of the other domain while preserving the underlying anatomical structure [8]. This is achieved by learning disentangled image representations.

Fig. 2.
figure 2

Overview of the multi-modal image translation network. The network employs the structure of MUNIT [7], which consists of two encoder-decoder pairs for the two domains: bSSFP and LGE, respectively.

As shown in Fig. 2, each image \({\varvec{{x}}}\) is disentangled into (a) a domain-invariant content code \(\varvec{c}\): \(\varvec{c}=E^c{({\varvec{{x}}})}\) and (b) a domain-specific style code \(\varvec{s}\): \(\varvec{s}=E^s{({\varvec{{x}}})}\) using the content encoder \(E^c\) and the style encoder \(E^s\) relative to its domain where the content code captures the anatomical structure and the style code carries the information for rendering the structure which is determined by the imaging modality. The image-to-image translation from one domain to the other is achieved by swapping latent codes in two domains. For example, translating a bSSFP image \(\varvec{x_b}\) to be stylized as LGE, is achieved by feeding the content code \(\varvec{c_b}\) for the bSSFP image and the style code \(\varvec{s_l}\) into the LGE decoder \(D_l\): \({\varvec{{x}}}_{bl} =D_{l}({\varvec{{c}}}_{b},{\varvec{{s}}}_{l})\).

Of note, during training, each style encoder is trained to embed images into a latent space that matches the standard Gaussian distribution \({\mathcal {{N}}}(0, I)\), minimizing the Kullback-Leibler (KL) divergence between the two. This allows to generate an arbitrary number of synthetic LGE images given a single bSSFP image during inference, by repeatedly sampling the style code from the prior distribution \({\mathcal {{N}}}(0, I)\). Of note, although this prior distribution is unimodal, the distribution of translated images in the output space is multi-modal thanks to the nonlinearity of the decoder [7]. We apply this translation network to translate annotated bSSFP images, resulting in a synthetic labelled LGE dataset, which will then be used to finetune a segmentation network. For more details about training the translation network, readers are referred to the original work by Huang et al. [7].

2.2 Image Segmentation

Fig. 3.
figure 3

Overview of the two-stage cascaded segmentation network. The architecture of each U-net is the same as the one of the vanilla U-net, except for two main differences: (1) batch normalization is applied after each convolutional layer; (2) a dropout layer (dropout rate = 0.1) is applied after each concatenation operation in the network’s expanding path to encourage model generalizability. Of note, in this diagram, we simplify the training procedure by omitting the pre-training procedure using labelled bSSFP images.

Let \(\varvec{x}_{l}\) be an observed LGE image, the aim of the segmentation task is to estimate label maps \(\varvec{y}_{l}\) having observed \(\varvec{x}_{l}\) by modeling the posterior \(p(\varvec{y}_{l}|\varvec{x}_{l})\). Inspired by curriculum learning [9] and transfer learning, we first train a segmentation network using annotated bSSFP images (source domain; easy examples) and then fine-tune it to segment LGE images (target domain; hard examples). Since labelled LGE images \(\{({\varvec{{x}}}_{l}, {\varvec{{y}}}_{l})\}\) are not available for finetuning, we use a synthetic dataset \(\mathcal {X}_{bl}: \{ ({\varvec{{x}}}_{bl},{\varvec{{y}}}_{b})\}_{1..N}\) generated by the aforementioned multi-modal image translator. Ideally, the posterior modelled by the network \(p(\varvec{y}_{b}|\varvec{x}_{bl})\) matches \(p(\varvec{y}_{l}|\varvec{x}_{l})\) when image space and label space are shared. For simplicity, we use \(\varvec{x}\) and \(\varvec{y}\) to denote an image and its corresponding label map from the synthetic dataset in the following paragraphs.

The segmentation network is a two-stage cascaded network which consists of two U-nets [6], see Fig. 3. Specifically, given an image \(\varvec{x}\) as input, the first U-net (U-net 1) aims at predicting four-class pixel-wise probabilistic maps \(\varvec{p_1}=f_{\text {U-net}}^{1}(\varvec{x}; \theta )\) for the three cardiac structures (i.e. LV, MYO, RV) and the background class (BG). Inspired by the auto-context architecture [10], we combine these learned probabilistic maps \(\varvec{p_1}\) from the first network with the raw image \(\varvec{x}\) to form a 5-channel input to train the second U-net (U-net 2) for fine-grained segmentation: \(\varvec{{p_2}} ={{f_\text {U-net}^2}}(\varvec{x},\varvec{p_1};\phi )\). By combining the appearance information from the image \({\varvec{{x}}}\) with the shape prior information from the initial segmentation \(\varvec{p_1}\) as input, the cascaded network has the potential to produce more precise and robust segmentations even in the presence of unclear boundaries for the different cardiac structures.

To train the network, we use a composite segmentation loss function \(\mathcal {L}_{seg}\) which consists of two loss terms: \( \mathcal {L}_{seg} = \mathcal {L}_{wce}+\lambda \mathcal {L}_{edge}. \) The first term \(\mathcal {L}_{wce}\) is a weighted cross entropy loss: \( \mathcal {L}_{wce}=-\sum _{m} \omega ^{m} \varvec{y}^{m} \log \left( \varvec{p}^{m}\right) \) where \(w^{m}\) denotes the weight for class m and \(\varvec{p}^{m}\) is the corresponding predicted probability map. We set the weight for myocardium \(\omega ^{MYO}\) to be higher than the weights for the other three classes to address class imbalance problem since there is a lower percentage of pixels that corresponds to the myocardium class in CMR images. The second term \(\mathcal {L}_{edge}\) is an edge-based loss which penalizes the disagreement on the contours of the cardiac structures. Specifically, we apply two 2D \(3 \times 3\) Sobel filters [11] \(S_k\) (k = 1, 2) to the soft prediction maps \(\varvec{p}\) as well as the one-hot heatmaps \(\varvec{y}\) of the ground truth to extract edge information along horizontal and vertical directions. The edge loss is then computed by calculating the \(l_{2}\) distance between the predicted edge maps and the ground truth edge maps: \( \mathcal {L}_{edge}=\sum _{m, m \ne BG} \sum _{k=1,2} \left\| f_{S_{k}}({\varvec{p}^m})-f_{S_{k}}(\varvec{y}^{m})\right\| _{2} \), where \(f_{S_{k}}(\varvec{p}^{m})\) is the edge map extracted by applying the sobel filter \(S_{k}\) to the predicted probabilistic map \(\varvec{p}^{m}\) for foreground class m.

By using the edge loss together with the weighted cross entropy for optimization, the network is encouraged to focus more on the contours of the three structures and the myocardium, which are usually more difficult to delineate. In our experiments, we set \(\lambda \) = 0.5 to balance the contribution of the two losses.

2.3 Post-processing

At inference time, each slice from a previously unseen LGE stack is fed to the cascaded network to get the probabilistic maps for the four classes. Dense conditional random field (CRF) [12] is then applied to refine the 2D predicted segmentation mask slice by slice. After that, 3D morphological dilation and erosion operations are applied to the whole segmentation stack to further improve the global smoothness. In particular, we perform the operations in a hierarchical order: first we apply them to the binary map covering all the three structures, then to the MYO and the LV labels, separately.

3 Experiments and Results

3.1 Data

The framework was trained and evaluated on the Multi-sequence Cardiac MR Segmentation Challenge (MS-CMRSeg 2019) datasetFootnote 1. We used a subset of 40 bSSFP and 40 LGE images to train the image translation network. Then, we created a synthetic dataset by applying the learned translation network to 30 labelled bSSFP images. Specifically, for each bSSFP image, we randomly sampled the style code from \({\mathcal {{N}}}(0, I)\) five times, resulting in a set of 150 synthetic LGE images in total. This synthetic dataset and the original 30 bSSFP images with corresponding labels formed the training set for the segmentation network. Exemplar results of these synthetic LGE images are provided in the supplemental material. For validation, we used a subset of 5 annotated LGE images provided by the challenge organizers.

3.2 Implementation Details

Image Preprocessing. To deal with the different image size and heterogeneous pixel spacing between different imaging modalities, all images were resampled to a pixel spacing of \(1.25\,\mathrm{mm} \times 1.25\,\mathrm{mm}\) and then cropped to \(192 \times 192 \) pixels, with the heart roughly at the center of each image. This spatial normalization would reduce the computational cost and task complexity in the following training procedure of image translation and segmentation, making the networks focus on the relevant regions. To identify the heart, we trained a localization network based on U-net using the 30 annotated bSSFP images in the training set to produce rough segmentations for the three structures. The localization network employs instance normalization layers which perform style normalization [13], encouraging the network invariance to image style changes (e.g. image contrast). As a result, the network is able to produce coarse masks localizing the heart on all bSSFP images and most LGE images even though it was trained on bSSFP images only. In case that this network might fail to locate the heart on certain LGE slices, we summed the segmentation masks across slices in each volume and then cropped them according to the center of the aggregated mask. After cropping, each image was intensity normalized.

Network Training. (1) For the image translation network, we used the official implementationFootnote 2 of [7]. Network configuration and hyper-parameters were kept the same as in [7] except the input and output images are 2D, single-channel. It was trained for 20k iterations with a batch size of 1. (2) For the segmentation network, we first trained the first U-net with the labelled bSSFP images and then fine-tuned it with synthetic LGE images. This procedure was replicated to train the second U-net with the parameters of the first U-net being fixed. Both networks were optimized using the composite loss \(\mathcal {L}_{seg}\) where adam was used for stochastic gradient descent. The learning rate was initially set to 0.001 and was then decreased to \(1 \times 10^{-5}\) for fine-tuning. The weights for BG, LV, MYO, and RV in \(\mathcal {L}_{wce}\) were empirically set to 0.2 : 0.25 : 0.3 : 0.25. During training, we applied data augmentation on the fly. Specifically, elastic deformations, random scaling and random rotations as well as gamma augmentation [14] were used. The algorithm was implemented using python and PyTorch and was trained for 1000 epochs in total on an NVIDIA\(^{\tiny {\textregistered }}\) Tesla P40 GPU.

3.3 Results

To evaluate the accuracy of segmentation results, the Dice metric and the average surface distance (ASD) between the automatic segmentation and the corresponding manual segmentation for each volume were calculated.

We compare the proposed method with two baseline methods: (1) a registration-based method and (2) a single U-net. Specifically, for the registration-based method, each LGE segmentation result was obtained by directly registering the corresponding bSSFP labels to the LGE image using MIRTK toolkitFootnote 3 for ease of comparison. The transformation matrix was learned by applying mutual information-based registration (Rigid+Affine+FFD) between the two images. For U-net, we trained it with two settings: (a) U-net: trained on labelled bSSFP images only; (b) U-net with fine-tuning (FT): trained on labelled bSSFP images and then fine-tuned using the synthetic LGE data, which is the same training procedure of the proposed method. Quantitative and qualitative results are shown in Table 1 and Fig. 4.

While the registration-based method (MIRTK) outperforms the U-net (see row 1 and row 2 in Table 1), it still fails to produce accurate segmentation on the myocardium (see the italic number in row 1), indicating the limitation of this registration-based method. However, by contrast, neural network-based methods (row 3–5) fine-tuned using the synthetic LGE dataset significantly improves the segmentation accuracy, increasing the Dice score for MYO by \({\sim }15\%\). This improvement demonstrates the learned translation network is capable of generating realistic LGE images while preserving the domain-invariant structural information that is informative to optimize the segmentation network. In particular, compared to U-net (FT), the proposed Cascaded U-net (FT) achieves more accurate segmentation performance with improvement in terms of both Dice and ASD (see bold numbers). The model even produces robust segmentation results on the challenging apical and basal slices (please see the last column in Fig. 4). This demonstrates the benefit of integrating the high-level shape knowledge and low-level image appearance to guide the segmentation procedure. In addition, the proposed post-processing further refines the segmentation results through smoothing, reducing the average ASD from 1.37 to 1.26 (see the last row in Table 1).

Table 1. Dice scores and ASD (mm) of the proposed segmentation method (Cascaded U-net) and baseline methods on the validation set. Bold numbers indicate the best scores among the results obtained by those methods before post-processing (PP) whereas italic numbers are those mean Dice scores under 0.700. FT: fine-tuning using the synthetic LGE dataset. N/A means that the ASD value cannot be calculated due to missing predictions for that cardiac structure.
Fig. 4.
figure 4

Segmentation results for the proposed Cascaded U-net and the baseline approaches. Our proposed method (the right-most column) produces more anatomically plausible segmentation results on the images, greatly outperforming the baseline methods, especially in the challenging cases: the apical (the top row) and the basal slices (the bottom row).

Finally, we applied ensemble learning to improve our model’s performance in the test phase. Specifically, we trained the proposed segmentation network for multiple times, each time regenerating a new synthetic LGE dataset for fine-tuning. We trained four models in total. Our final submission result for each test image was obtained by averaging the probabilistic maps from these models and then assigning to each pixel the class with the highest score. In the testing stage of the competition, the method achieves very promising segmentation performance on a relative large test set (40 subjects), with an average Dice score of 0.92 for LV, 0.83 for MYO, and 0.88 for RV; an ASD of 1.66 for LV, 1.76 for MYO, and 2.16 for RV.

4 Conclusion

In this paper, we showed that synthesizing multi-modal LGE images from labelled bSSFP images to finetune a pre-trained segmentation network shows impressive segmentation performance on LGE images even though the network has not seen real labelled LGE images before. We also demonstrated that the proposed segmentation network (Cascaded U-net) outperformed the baseline methods by a significant margin, suggesting the benefit of integrating the high-level shape knowledge and low-level image appearance to guide the segmentation procedure. More importantly, our cascaded segmentation network is independent of the particular architecture of underlying convolutional neural networks. In other words, the basic neural network (U-net) in our work can be replaced with any of the state-of-the-art segmentation network to potentially improve the prediction accuracy and robustness. Moreover, the proposed solution based on unsupervised multi-modal style transfer is not only limited to the cardiac image segmentation but can be extended to other multi-modal image analysis tasks where the manual annotations of one modality are not available. Future work will focus on the application of the method to the problems such as domain adaptation for multi-modality brain segmentation.

Supplemental Material

See Fig. 5.

Fig. 5.
figure 5

Exemplar synthetic LGE images generated from bSSFP images using the multi-modal image translation network. Given one bSSFP image (column 1), the translation network translates the image into multi-modal LGE-like images (column 2 to 4). These translated images differ in image brightness and contrast as well as the intensity distribution in the cardiac region, while preserving the same cardiac anatomy. These synthetic images, in together with the annotations on the original bSSFP images (the last column) contribute to the synthetic dataset which is used to fine-tune the proposed segmentation network.