Keywords

1 Introduction

Radiofrequency (RF) ablation is a common technique in clinical routine for the atrial fibrillation (AF) treatment via electrical isolation. However, the success rate of some ablation procedures is low due to the existence of incomplete ablation pattern (gaps) on the left atrium (LA). Late gadolinium enhanced magnetic resonance imaging (LGE MRI) has been an important tool to detect gaps in ablation lesions, which are located on the LA wall and pulmonary vein (PV). Thus, it is important to segment LA from LGE MRI for the AF treatment. Manual delineations of the LA from LGE MRI can be subjective and labor-intensive, and automating this segmentation remains challenging.

In recent years, many algorithms have been proposed to perform automatic LA segmentation from medical images, but mostly for non-enhanced imaging modalities. Conversely, LGE MRI has received less attention with respect to developed methods of LA segmentation to assist the ablation procedures. Most of the current studies on LA segmentation from LGE MRI are still based on time-consuming and error-prone manual segmentation methods [5, 12]. This is mainly because LA segmentation methods in non-enhanced imaging modalities are difficult to directly apply to LGE MRI, due to the existence of the contrast agent and its low-contrast boundaries. Therefore, existing conventional automated LA segmentation of LGE MRI approaches generally require hard available supporting information, such as shape priors [20] or additional MRI sequences [8]. Recently, with the development of deep learning (DL) in medical image computing, some DL-based algorithms have been proposed for automatic LA segmentation directly from LGE MRI [7, 15].

Fig. 1.
figure 1

Multi-center pre- and post-ablation LGE MRIs. The images differ in contrast, enhancement and background.

However, the generalization ability of the DL-based models is limited, i.e., the performance of a trained model on the known domain (source domain) will be degraded drastically on an unseen domain (target domain). This is mainly due to the existence of a domain shift or distribution shift, which is common among the data collected from different centers and vendors, as shown in Fig. 1. In the clinic, it is impractical to retrain a model each time for the data collected from new vendors or centers. Therefore, improving the model generalization ability is important to avoid the need of retraining. Current domain generalization (DG) methods can be categorized into three types: (1) domain-invariant feature learning approaches, such as disentangled representation [11]; (2) model-agnostic meta-learning algorithms, which optimize on the meta-train and meta-test domain split from the available source domain [4]; (3) data augmentation strategies, which increase the diversity of available data [1].

In this work, we investigate the generalization abilities of four commonly used segmentation models, i.e., U-Net [14], UNet++ [19], DeepLab v3+ [3] and multi-scale attention network (MAnet) [2]. As Fig. 2 shows, we select two different sources of training data, i.e., target domain (TD) and source domains (SD) to evaluate the model generalization ability. Besides, we compare three different DG schemes for LA segmentation of multi-center LGE MRIs. The schemes include histogram matching (HM) [10], mutual information based disentangled (MID) representation [11], and random style transfer (RST) [9, 18].

Fig. 2.
figure 2

Illustration of the LA segmentation models for multi-center LGE MRIs.

2 Methodology

In this section, we describe the segmentation models we employ, formulate the DG problem (illustrated in Fig. 2) and describe the investigated three DG strategies.

2.1 Image Segmentation Models

All our segmentation models are supervised approaches based on convolutional neural networks. Typically, the models are trained using a training database \(\mathcal {T_\mathcal {D}}=\{(X_m, Y_m), m=1,\dots , M\}\) with images \(X\in \mathcal {D}\) from a single domain \(\mathcal {D}\) and corresponding labels Y. The segmentation model f(X) can be defined as,

$$\begin{aligned} f(X) \rightarrow Y, X \in \mathcal {D}, \end{aligned}$$
(1)

where \(X,Y \in \mathbb {R}^{1 \times H \times W}\) denote the image set and corresponding LA segmentation set.

We consider four commonly used segmentation models, all with an encoder-decoder architecture. The first model is a vanilla U-Net. U-Net++ is a modified version of the U-Net with a more complex decoder. DeepLab v3+ employs atrous spatial convolutions, and MAnet introduce multi-scale attention blocks.

2.2 Domain Generalization Models

The generalization ability of such models is limited, i.e., a model trained on a source domain \(\mathcal {D}\) might perform poorly for images \(X\notin \mathcal {D}\). DG strategies are therefore proposed to generalize models to unseen (target) domains. Given N source domains \(\mathcal {D}_s=\left\{ \mathcal {D}_{1}, \mathcal {D}_{2}, \cdots , \mathcal {D}_{N}\right\} \), we aim to construct a DG model \(f^{DG}(X)\),

$$\begin{aligned} f^{DG}(X) \rightarrow Y, X \in \mathcal {D}_s \cup \mathcal {D}_{t}, \end{aligned}$$
(2)

where \(\mathcal {D}_{t}\) are unknown target domains.

We investigate three DG strategies for LA segmentation from LGE MRI. In the first and simplest approach, HM is performed on the images from the target domain to match its intensity histogram onto that of the source domains. The model and training process do not change. The second (MID-Net [11]) and third method (RST-Net [9, 18]) are state-of-the-art methods employing different approaches to achieve DG. In MID-Net, domain-invariant features are extracted by mutual information based disentanglement in the latent space, while in RST-Net available domains are augmented via pseudo-novel domains.

3 Materials

3.1 Data Acquisition and Pre-processing

LGE MRIs with various image qualities, types and imaging parameters were collected from three centers, as Table 1 shows. The centers consist of Utah School of Medicine (Center 1), Beth Israel Deaconess Medical Center (Center 2), and Imaging Sciences at King’s College London (Center 3). The dataset were selected from two public challenge, i.e., MICCAI 2018 Atrial Segmentation Challenge [17] and ISBI 2012 Left Atrium Fibrosis and Scar Segmentation Challenge [13]. A total of 140 images were collected and acquired either pre- or post-ablation. The acquisition time of pre-ablation scans varied slightly among 1 to 7 days, but that of post-ablation had a range from 1 to 27 months depending on the imaging center.

The LGE MRIs from center 1, 2, 3 and 4 were reconstructed to 0.625 \(\times \) 0.625 \(\times \) 1.25 mm, (0.7–0.75) \(\times \) (0.7-0.75) \(\times \) 2 mm, 0.625 \(\times \) 0.625 \(\times \) 2 mm, and 0.625 \(\times \) 0.625 \(\times \) 1.25 mm, respectively. All 3D images were divided into 2D slices as network inputs and then were cropped into a unified size of 192 \(\times \) 192 centering at the heart region, with a intensity normalization via Z-score. Random rotation, random flip and Gaussian noise augmentation were applied during training. The data distribution in the subsequent experiments is presented in Table 2.

Table 1. Image acquisition parameters of the multi-center LGE MRIs.

3.2 Gold Standard and Evaluation

All the LGE MRIs were manually delineated by the experts from the corresponding centers. The manual LA segmentation were regarded as the gold standard. For LA segmentation evaluation, Dice score, average surface distance (ASD) and Hausdorff distance (HD) were applied. Each image from the three centers were assigned an image quality score by averaging the scores from two experts, mainly based on the visibility of enhancements and the existence of image artefacts (please see the Supplementary Material file).

3.3 Implementation

The proposed framework was implemented in PyTorch, running on a computer with 2.20 GHz Intel(R) Xeon(R) E5-2630 v4 CPU and a GeForce GTX 1080 Ti GPU. We employed the released Segmentation Models [16] for experiments. All the backbones of the four semantic segmentation models are the efficientnet-b6. We used the Adam optimizer to update the network parameters. The initial learning rate was set to 5e−5 and multiplied by 0.95 every 10 epochs.

Table 2. The distribution of training dataset and test dataset of LGE MRI from the three centers (C-i: center i).

4 Experiment

4.1 Comparisons of Different Semantic Segmentation Networks

Table 3 summarizes the LA segmentation results in terms of Dice, ASD and HD based on the four semantic segmentation models. One can see that all the segmentation models had a performance decrease when the target domain was not included in the training data. It proves that the generalization capabilities of currently commonly used DL-based segmentation models are still very limited. When we observe the Dice value of the LA segmentation, the obtained performances of the four models training on the TD are very close. However, DeepLab v3+ achieved significantly better ASD and HD than the other three models. It may be attributed to its atrous convolution and spatial pyramid pooling module, which promote the network to learn more spatial information. When training on the SD, the performance decrease of was DeepLab v3+ was smaller than other three models. Therefore, in this work DeepLab v3+ is regarded as the baseline model, and we will improve its generalization ability using the proposed DG schemes.

Table 3. Performance of the four segmentation models on the multi-center LGE MRI for LA segmentation. The training and test data distribution refer to Table 2, i.e., test data is from Center 1 dataset, while training data is from the TD (C-1 dataset) and SD (C-2, 3 dataset), respectively.

4.2 Comparisons of Post- and Pre-ablation LGE MRI

As Fig. 1 shows, the pre- and post-ablation LGE MRI can have high variability of tissue appearance. There are already several studies that have shown the performance of LA scar segmentation and quantification varied among pre- and post-ablation LGE MRI [6]. This is mainly because that when comparing to post-ablation images, the scars on pre-ablation LGE MRIs are hard to distinguish even for experts. In contrast, as far as we know, there are to this date no studies comparing the LA segmentation performance for pre- and post-ablation LGE MRI. Here, we compared and analyzed the LA segmentation performance on pre- and post-ablation LGE MRI on the four basic segmentation models.

Figure 3 presents the Dice and HD value obtained by the four models on the pre- and post-ablation images, separately. One can see that, the four models all suffered from an accuracy deterioration caused by the domain shift on both pre- and post-ablation LGE MRIs, which is consistent with the results in Table 3. Besides, the Dice obtained by the four models is similar on both pre- and post-ablation LGE MRIs, but DeepLab v3+ performed better in terms of HD, especially on pre-ablation data.

In summary, there is no evident performance difference between pre- and post-ablation data for the four models. However, the standard deviations of the Dice and HD values of the LA segmentation on the pre-ablation data are generally lower than those of post-ablation images. It may indicate that the segmentation model is more robust for the pre-ablation data of the multi-center LGE MRIs.

Fig. 3.
figure 3

LA segmentation results of four segmentation models with different source of training datasets: (a) Dice of post-ablation cases; (b) Dice of pre-ablation cases; (c) HD of post-ablation cases; (d) HD of pre-ablation cases.

4.3 Comparisons of Different Generalization Models

Table 4 summarized three DG schemes to compare with the baseline DeepLab v3+ model training on multi-source domains. One can see that three tested generalization strategies worked when comparing with baseline results. Among the three methods, the conventional histogram matching algorithm performed best. The MID-Net and RST-Net obtained similar results in terms of Dice, but the ASD and HD of MID-Net were worse.

Figure 4 presents the 2D visualization results of the four methods on post-/pre-ablation LGE MRI. In the post-ablation case, three DG schemes could identify some missing PV regions by the DeepLab v3+. Similarly, in the pre-ablation subject, MID-Net and RST-Net both mitigated the segmentation errors in the mitral valve (MV) area. It proved that for the both post- and pre-ablation cases, the employed DG methods worked.

Table 4. Performance of different generalization models training on multi-source domains for LA segmentation.
Fig. 4.
figure 4

2D visualization of the LA segmentation based on the three generalization models on the multi-center LGE MRIs. Here, the yellow arrows indicate the wrong LA segmentation regions, i,e. PV and MV. (Color figure online)

5 Conclusion

In this work, we first investigated the generalization abilities of different semantic segmentation models for LA segmentation from multi-center LGE MRIs. The results showed that all the performance of the commonly used segmentation models degraded dramatically on the unknown domain. It emphasized the importance of promoting deep models with efficient inherent generalization abilities for LGE MRI data processing from different centers. We then introduced three DG strategies, which were all able to alleviate the performance decrease. Our study found that, quite surprisingly, the simple histogram matching strategy is the most effective method for DG on the LA segmentation of multi-center LGE MRI data. It may indicate that there is still large scope for further algorithmic developments in DG. In future, we will find the inherent differences of multi-center LGE MRIs, and develop a targeted and effective DG strategy to solve this problem. Moreover, we will further study the domain shift between post- and pre-ablation LGE MRI from the same center, and the label variations of LGE MRIs from different centers.