Keywords

1 Introduction

Atrial fibrillation (AF) is the most common cardiac arrhythmia observed in clinical practice, occurring in up to 2\(\%\) of the population and rising fast with advancing age [10]. Recently, late gadolinium enhancement magnetic resonance imaging (LGE MRI) has been considered as a promising and reliable technique to visualize and quantify left atrial scars [11]. The segmentation or quantification of LA and scars provides important information for the clinical diagnosis and the treatment of AF patients. Since manual delineations of LA and scars are time-consuming and subjective, it is crucial to develop techniques for automatic segmentation of the LA cavity and scar for LGE MRI.

However, the poor image quality in LGE MRI, various shapes of LA, the surrounding enhanced noise, and the complex patterns of scars make it challenging to automatically and accurately segment LA and scars. Li et al. reviewed algorithms proposed to perform the LA cavity and scar segmentation or quantification from medical images in [2]. Among them, deep learning-based methods are dominant in these two tasks and they achieved promising results [1, 15, 16, 19,20,21,22]. Nevertheless, most of the methods mentioned in [2] normally solved the two tasks independently and ignored the intrinsic spatial relationship between LA and scars which are located on the LA wall, as Fig. 1 shows. The performance of segmenting the LA cavity and scar may be bottlenecked by the failure in exploiting the correlation between these two tasks. Multi-task learning has been shown to outperform methods considering related tasks separately by leveraging the relationship between different tasks. Recently, Li et al. [1] developed a novel framework where LA segmentation, scar projection onto the LA surface, and scar quantification are performed simultaneously in an end-to-end style. The relationship between LA segmentation and scar quantification was explicitly explored and has shown significant performance improvements for both tasks in their work.

Fig. 1.
figure 1

Examples of axial views from two cases in the LAScarQS2022 dataset. The LA cavity and scar are highlighted in blue and red, respectively. One can see that scars are located on the LA wall. (Color figure online)

This paper, inspired by [1], proposes a coarse-to-fine framework to achieve joint segmentation of LA and scars. In the coarse stage, a vanilla 3D U-Net [14] is trained to coarsely segment LA and crop a region of interest (ROI) that contains the whole LA. In the fine stage, a modified dual-task learning 3D U-Net consisting of two decoders for LA and scars segmentation respectively, is proposed to segment LA and scars simultaneously. We also introduce an edge-enhanced feature-guided module (EFGM) at the skip connection between the shared encoder and the decoder layers for scar segmentation. It includes a difference convolution submodule based on 3D central difference convolution (CDC) [7], followed by a spatial attention submodule. We argue that it can help pass the edge-enhanced features to guide the localization and segmentation of scars as they are located at the LA wall while utilizing the spatial relationship between LA and scars. In addition, a dilated inception module (DIM) to extract multi-scale features is plugged in at the bottleneck of the modified 3D U-Net.

2 Methods

Figure 2 shows the pipeline of our coarse-to-fine joint segmentation framework. In our work, we develop a two-stage strategy to perform coarse-to-fine joint segmentation of the LA cavity and scar. In the coarse stage, a vanilla 3D U-Net is first trained to segment the ROI which contains the whole LA from the entire 3D volume of each MRI. After the ROIs are detected, they are all cropped out with a fixed size from the processed MRIs and then fed into the proposed modified multi-task learning 3D U-Net to obtain segmentation results of LA cavity and scars simultaneously in the fine stage.

Fig. 2.
figure 2

The overall pipeline of our coarse-to-fine joint segmentation of left atrial and scars framework. The network 1 is a vanilla 3D U-Net to segment the ROI coarsely. The network 2 is a modified 3D U-Net consisting of two decoders for LA and scar segmentation respectively to get more accurate segmentation results.

2.1 Coarse Segmentation of ROIs

As shown in Fig. 1, the regions of the LA cavity and scar are only part of the whole volume, especially for a scar of such a small size. Therefore, we first employ a coarse segmentation stage to segment the ROI containing the LA cavity and scar,aiming at alleviating the class imbalance problem and discarding redundant or irrelevant surrounding voxels. We choose the vanilla 3D U-Net as our coarse segmentation network for its effectiveness in various medical image segmentation tasks without any complex design.

Fig. 3.
figure 3

(a)An overview of our proposed modified 3D U-Net with two decoders for LA and scar segmentation, respectively. (b)Edge-enhanced feature-guided module using 3D central difference convolution. (c)Dilated Inception module using dilated convolutions with different rates and shortcut connections.

2.2 Fine and Joint Segmentation of LA and Scars

Most of the automatic scar segmentation or quantification methods require an accurate initial LA segmentation considering the prior knowledge that atrial scars are located on the LA wall. Additionally, previous methods usually solved the two tasks independently and ignored the intrinsic spatial relationship between LA and scars [2]. Therefore, we propose a modified 3D U-Net consisting of two decoders for LA and scar segmentation and train it in a multi-task learning manner in the fine stage. Figure 3 (a) provides an overview of the proposed dual-task learning network architecture. First, an edge-enhanced feature guided module (EFGM) is introduced at the skip connection between the shared encoder and decoder layers for scar segmentation. Different from the original skip connection, the EFGM, which can serve as an edge detector, helps preserve differential or edge-related information via extracting edge-enhanced features and passing them to the corresponding layers at the scar segmentation decoder. In addition, a dilated inception module (DIM) is introduced at the end of the original encoder. With the equipment of DIM, the modified 3D U-Net can capture deep multi-scale semantic features, which is beneficial to the joint segmentation of LA cavities and scars as they are totally different in size. The details of the EFGM and the DIM are described below.

Edge-enhanced Feature Guided Module. Difference Convolution, which explicitly calculates pixel differences during convolution to aggregate local gradient information, has been gradually used in computer vision tasks such as edge detection [6], face recognition [5], gesture recognition [7], and so on in recent years. By contrast, vanilla convolution aggregates intensity-level information [6]. As a result, although modern CNNs based on vanilla convolution are powerful enough to learn rich and hierarchical image representations, it is still hard for them to focus on edge-related features due to the lack of explicit encoding for gradient information [5]. The formulations of vanilla convolution and difference convolution can be written as (take 2D convolution as an example):

$$\begin{aligned} y = \sum \nolimits _{i=1}^{k \times k} w_{i} \cdot x_{i} \qquad (vanilla \ convolution) \end{aligned}$$
(1)
$$\begin{aligned} y = \sum \nolimits _{x_{i}, x_{j} \in S} w_{i} \cdot (x_{i} - x_{j}) \qquad (difference \ convolution) \end{aligned}$$
(2)

where, \(x_{i}\) and \(x_{j}\) are the input pixels, \(w_{i}\) is the weight in the \(k \times k\) convolution kernel. S is the local receptive filed over the feature map.

As mentioned before, scars are located on the LA wall, so we intuitively argue that the edge information of the LA cavity is important to the localization and further segmentation of scars. In the vanilla U-Net [13], long skip connections were introduced to pass features from the encoder path to the decoder path to recover spatial information lost during downsampling. However, original low-level features which are simply passed through the skip connections to fuse with high-level features may contain substantial redundant location or spatial information. Motivated by these assumptions, we propose the EFGM equipping difference convolution at the skip connection between the encoder and the decoder for scar segmentation in the modified 3D U-Net only to pass the edge-enhanced features containing rich edge-related information. With the implementation of the EFGM, our model can learn to suppress irrelevant regions and highlight salient regions (edge of LA cavity) useful for more precise localization and segmentation of scars due to the ability of difference convolution to extract local differential information from feature maps. Moreover, the edge-enhanced features can also be regarded as localization guidance for decoding high-level semantic features in the decoder path in scar segmentation, which benefits the segmentation of scars located on the LA wall.

Figure 3 (b) illustrates an edge-enhanced feature-guided module. Each module mainly includes a difference convolution submodule where we utilize a 3D central difference convolution (CDC) [7], which is formulated as follows::

$$\begin{aligned} y = \sum \nolimits _{i\in C} w_{i} \cdot (x_{i} - x_{0}) \end{aligned}$$
(3)

3D convolution with kernel size 3\(\times \)3\(\times \)3 and dilation 1 is used for demonstration. The local receptive field cube for the 3D convolution is \(C=(-1, -1, -1), (-1, -1, 0) , ... ,(0, 1, 1), (1, 1, 1)\).

The same as [6], we use the separable depth-wise convolutional structure with a shortcut for fast inference and easy training. The residual path in this module includes a depth-wise convolutional layer, a ReLU layer, and a point-wise convolutional layer sequentially. To further highlight the edge-related features and filter background noise, we apply the spatial attention mechanism at the end of the difference convolution submodule.

Dilated Inception Module. Motivated by the Inception-ResNet-V2 module [8] and Atrous Spatial Pyramid Pooling (ASPP) [9], we propose the DIM to encode deep multi-scale features for both LA and scar segmentation. As shown in Fig. 3 (c), the DIM has four parallel paths with dilated convolutions with different dilation rates followed by one 1\(\,\times \,\)1 convolution. At last, we directly add the original features with the other four multi-scale features to make a shortcut mechanism. Different dilation rates can increase the receptive field sizes of parallel convolution paths by adding zeros between kernel elements without incrementing parameters. As a result, the proposed DIM can capture features of objects of various sizes, such as LA cavities and scars, due to the combination of the dilation convolutions with different dilation rates.

2.3 Loss Function

For the coarse stage only regarding the segmentation of the LA cavity, our loss function is Dice Loss.

For the fine and joint segmentation stage, our loss function is the sum of the loss function of segmenting the LA cavity and the loss function of segmenting scar, as shown in Eq. 4:

$$\begin{aligned} Loss_{total}=Loss_{LA}+Loss_{scar} \end{aligned}$$
(4)

The loss function used in LA segmentation is the sum of the Dice Loss and the Cross-Entropy Loss, as shown in Eq. 5:

$$\begin{aligned} Loss_{LA}=Loss_{ce}+Loss_{dice} \end{aligned}$$
(5)

For scar segmentation, as the scar only takes up a small fraction of the whole volume, which can cause a severe class-imbalance problem, the loss function is the sum of the Dice Loss and the Weighted Cross Entropy Loss, as shown in Eq. 6:

$$\begin{aligned} Loss_{scar}=Loss_{wce}+Loss_{dice} \end{aligned}$$
(6)

3 Experiments

3.1 Dataset and Data Preprocessing

MICCAI 2022-LAScarQS2022 (Left Atrial and Scar Quantification & Segmentation Challenge) [1,2,3] provides 194 LGE MRIs acquired in real clinical environment from patients suffering atrial fibrillation (AF) and is composed of two tasks: 1. LA Scar Quantification 2. Left Atrial Segmentation from Multi-Center LGE MRIs. In this study, we focus on task 1.

The training dataset provided for task 1 of the LAScarQS 2022 challenge [1,2,3] consists of 60 LGE-MRIs with segmentation annotations of LA cavities and scars. In our experiments, the images and masks were first resampled to the isotropic resolution of \(1 \times 1 \times 1 mm^3\) . And then, all the volumes were cropped and zero-pad to the uniform size of \(576 \times 576 \times 96\) . Then we used a 3D version of contrast limited adaptive histogram localization (CLAHE) [4] to enhance the contrast of LGE-MRIs, and finally applied sample-wise normalization.

3.2 Implementation Details

Our experiments were run on NVIDIA GeForce RTX 3090 GPU with 24 GB RAM. We firstly down-sampled the input for the coarse segmentation from \(576 \times 576 \times 96\) to \(144 \times 144 \times 48\) due to memory restriction. The first network was trained for 100 epochs using the Adam optimizer with a fixed learning rate of 0.001. The batch size is 4. We randomly chose 48 out of the 60 MRIs as training data; the rest 12 are validation data. After the training procedure was completed, the model with the best dice scores on validation data was saved for ROI detection. For the fine and joint segmentation, we first computed the barycenter of the ground truth and cropped a region of size \(288 \times 192 \times 96\) centered with the barycenter from the original data. Then the cropped ROIs were fed into the second network. The second network was trained for 100 epochs using the Adam optimizer with an initial learning rate of 0.001. The learning rate was reduced by 0.1 every 1000 iterations and the batch size is 2. We randomly split the data into training (48 subjects) and testing (12 subjects) subsets for the fine stage.

To reduce the risk of over-fitting and further improve the generalization ability of our framework, we also apply data augmentation including random flipping and rotation in both networks training.

At the inference stage, each MRI volume from the testing subset was firstly down-sampled to \(144 \times 144 \times 48\) and fed into the first network. The network would output the predicted binary mask used to locate the ROI. We computed the barycenter of the predicted mask, cropped a region of size \(288 \times 192 \times 96\) centered with this barycenter, and then fed it into the second network. The second network output the predicted masks of the LA cavity and scar simultaneously inside the target region and mapped them back to the original size volume, which finished the inference. The end-to-end segmentation process takes approximately 9 s for each case.

4 Results and Discussions

4.1 Ablation Experiments

We run a number of ablation experiments to evaluate the effectiveness of multi-task learning and the two proposed modules in our modified 3D U-Net. All the experiments were run in the coarse-to-fine framework mentioned above, sharing the same coarse stage and we only performed different models in the fine stage to conduct ablation experiments. Here, U-Net\(_{LA}\) denotes the vanilla 3D U-Net architecture for LA segmentation individually. U-Net\(_{scar}\) denotes the vanilla 3D U-Net architecture for scar segmentation individually. U-Net\(_{LA \; and \; scar}\) denotes the multi-task learning 3D U-Net consists of a shared encoder and two decoders for joint segmentation of LA and scars, which is also our baseline model. Besides, we successively tested the performance of the baseline model incorporating the DIM, the baseline model incorporating the EFGM, and the baseline model incorporating both the EFGM and the DIM. All these experiments were conducted using the same aforementioned training configurations and loss functions.

All the models were evaluated through the validation platform provided by the LAScarQS2022 organizer. As shown in Table 1, the segmentation performance of LA was evaluated by the Dice score, average surface distance (ASD) and Hausdorff distance (HD). The scar’s quantification performance was evaluated via first projecting the segmentation result onto the manually segmented LA surface. Then, the Accuracy, Specificity and Sensitivity measurement of the two areas in the projected surface, Dice score (Dice) and generalized Dice score (\(Dice_{g}\)) were used as indicators of the accuracy of scar quantification [1]. \(Dice_{g}\) is a weighted Dice score by evaluating the segmentation of all labels [17, 18], which is formulated as follow [1]:

$$\begin{aligned} Dice_{g}=\frac{2\sum \nolimits _{k=0}^{N_{k}-1}|S_{k}^{auto} \cap S_{k}^{manual} |}{\sum \nolimits _{k=0}^{N_{k}-1}(|S_{k}^{auto} + S_{k}^{manual} |)} \end{aligned}$$
(7)

where \(S_{k}^{auto} \) auto and \(S_{k}^{manual} \) indicate the segmentation results of label k from the automatic method and manual delineation, respectively, and \(N_{k} \)is the number of labels.

Table 1. Summary of the quantitative evaluation results of LA segmentation and scar quantification on the LAScarQS 2022 validation set in ablation experiments. EFGM denotes the proposed edge-enhanced feature-guided module discussed in Sect. 2.2, and DIM denotes the proposed dilation inception module discussed in Sect. 2.2.

Table 1 presents the quantitative results for LA segmentation and scar quantification. It demonstrates that our baseline model outperforms U-Net\(_{LA}\) and U-Net\(_{scar}\) which consider these two related tasks separately, verifying the superiority of multi-task learning. The relationship between LA segmentation and scar segmentation is exploited implicitly through multi-task learning. Figure 4 and Fig. 5 illustrate the segmentation results of the LA cavity and scar, respectively, from the mentioned ablation experiments. One can see that the boundary of segmentation results of U-Net\(_{LA}\) is far from the boundary of the ground-truth and U-Net\(_{scar}\) tends to make mistakes on non-LA wall regions and under-segment scars, while the baseline model results are closer to the ground truth.

Meanwhile, Table 1 illustrates the effectiveness of each proposed module, suggesting the advantage of the EFGM and the DIM. Compared to the baseline model, incorporating the DIM reduces about 2 mm in HD in segmenting LA and improves the Dice in segmenting scar by around 2%. This observation implies the need for learning deep multi-scale features when coping with segmenting targets of different sizes since scars are quite small compared with the LA cavity. Note that incorporating the EFGM into the baseline model improves the Dice in segmenting scar by around 2.5% compared to the baseline model and outperforms the baseline model only equipped with the DIM. As shown in Fig. 5, introducing the EFGM can alleviate the problem of under-segmenting scars observed in other models. Furthermore, it indicates that the edge-related information can effectively guide the segmentation of scars while encoding the prior spatial knowledge that scars are located at the LA wall into the framework, thus utilizing the spatial relationship between LA and scars more explicitly. However, the performance of the model which only incorporates the EFGM even degrades a little in Dice and ASD of segmenting LA compared to the baseline model. We argue that this is because the EFGM is mainly designed for the scar segmentation task, which is much more challenging than LA segmentation, so it may not improve the segmentation performance of LA.

The highest performance gain (about 1.5% in Dice of LA segmentation and 5% in Dice of scar segmentation compared to U-Net\(_{LA}\) and U-Net\(_{scar}\)) is observed when incorporating both the DIM and the EFGM. Moreover, the model equipped with both two modules achieves the best segmentation performance in almost all metrics in both tasks. Figure 4 also demonstrates that the boundary of LA segmentation results is the most consistent with the ground truth among all the experiments, while Fig. 5 illustrates that our final model can detect and segment scars more precisely than any other model in our ablation experiments. It shows that the combination of two modules can further improve the performance of the framework. Note that the model incorporating both the DIM and the EFGM outperforms the model only incorporating the DIM in LA cavity segmentation, but the introduction of the EFGM cannot improve the segmentation of the LA cavity as mentioned above. This finding is probably attributed to the explanation that relatively good performance in the scar segmentation task can boost the LA segmentation task during the simultaneous optimization process in multi-task learning.

Fig. 4.
figure 4

Visualization of the LA cavity segmentation results on the LAScarQS 2022 validation set by using different training combinations.

Fig. 5.
figure 5

Visualization of the scar segmentation results on the LAScarQS 2022 validation set by using different training combinations.

4.2 Comparison Experiments

We implemented U-Net with different loss functions to conduct comparison experiments for both LA segmentation and scar segmentation. We used the same hyper-parameters in these experiments for consistency.

Table 2 tabulates the quantitative comparison results for LA segmentation and scar quantification. For LA segmentation, our method achieves 0.875 in Dice, demonstrating its advantage in segmenting the LA cavity more accurately. Meanwhile, the proposed coarse-to-fine joint segmentation framework obtains the smallest HD and ASD, which means it can identify the correct boundaries of LA cavities despite their various shape. Figure 6 also proves that our proposed model can achieve better segmentation compared to other methods.

Note that our method shows significant improvement in scar quantification results. As demonstrated in Fig. 7, the vanilla U-Net models tend to under-segment scars while our method alleviates this problem. With the help of the DIM and the EFGM, edge-enhanced low-level and multi-scale features are fused while more contextual semantic information and more precise spatial information are integrated, facilitating the segmentation of scars which are hard to recognize and locate due to their small size, complex patterns, and surrounding noise.

Overall, our method outperformed superiorly to other methods, implying its effectiveness. This could result from the two major contributions in our framework. First, the multi-task learning model can effectively exploit the relationship between LA and scars. Moreover, the EFGM and the DIM are introduced to further boost the multi-task learning process through providing spatial guidance for segmenting scars and learning multi-scale representation. Second, the two-stage coarse-to-fine framework can suppress the background pixels that dominate foreground pixels in the scar segmentation, thus significantly mitigate the class imbalance problem.

Fig. 6.
figure 6

Visualization of the LA cavity segmentation results on the LAScarQS 2022 validation set compared with other classic methods.

Fig. 7.
figure 7

Visualization of the scar segmentation results on the LAScarQS 2022 validation set compared with other classic methods.

Table 2. Summary of the quantitative evaluation results of LA segmentation and scar quantification on the LAScarQS 2022 validation set in comparison experiments.

5 Conclusion

This paper proposes a coarse-to-fine framework for joint segmentation of LA and scars from LGE MRI. The coarse segment network is a vanilla 3D U-Net to extract ROI of the volume, and the fine segment network is a modified 3D U-Net consisting of two decoders for LA and scar segmentation, respectively, aiming at segmenting the LA cavity and scar simultaneously in a multi-task learning manner. In addition, we introduce an edge-enhanced feature-guided module using 3D central difference convolution to exploit the spatial relationship between LA and scars and a dilated inception module to learn multi-scale semantic features in our modified 3D U-Net. We evaluated our method on the LAScarQS 2022 validation dataset, and the convincing results suggest the effectiveness of the newly proposed coarse-to-fine framework, especially for scar segmentation or quantification.