Introduction

The presence of nonviable scar tissue in left ventricle (LV) has a crucial prognostic and therapeutic role. Indeed, it enables assessing LV remodeling, as well as patient’s cardiac dysfunction or mortality [1,2,3].

Scar-tissue presence is commonly identified using cardiac magnetic resonance with late gadolinium enhancement (CMR-LGE) [4]. CMR-LGE is performed within ten to twenty minutes after the intravenous administration of gadolinium, thus when gadolinium has already been washed out from healthy tissues. This results in hyperenhanced (HE) intensity areas where nonviable scar tissue is present [5].

In clinical practice, scar analysis from CMR-LGE images is performed qualitatively using the American heart association (AHA) 17-segment model [6], which provides conventional risk stratification [7]. In particular, the average transmural extent of HE areas is estimated within each segment (0%, 1–25%, 26–50%, 51–75%, 76–100%) to perform diagnosis. Guidelines also suggest to compare CMR-LGE images with cine and perfusion images (if the latter are obtained) to correctly categorize ischemia and viability. However, a quantitative approach to scar analysis would provide supplementary information to be exploited for diagnosis and follow-up evaluation [8].

Several methods have been proposed in literature for quantitative scar analysis that rely on automatic or semi-automatic scar segmentation, once the LV myocardial region, defined by its epicardial and endocardial borders, has been manually traced from CMR-LGE images [8].

Quantitative scar-analysis methods include two widely used threshold-based semi-automatic algorithms [7]. These two algorithms are semi-automatic as a region of interest has to be manually identified within the myocardial region for calculating the threshold values. The two thresholds are defined as: (1) the intensity value n-standard deviations higher than the mean intensity of a user-defined region in the normal myocardium (nSD), (2) the half value of maximum intensity of a user-defined HE region [full width at half maximum (FWHM)]. In addition, other semi-automatic methods in the literature largely exploit pixel-intensity information for scar-tissue segmentation through thresholding (e.g., [9,10,11,12,13,14,15]). Popular automatic approaches include also clustering techniques, such as Gaussian mixture model (GMM) [16], fuzzy c-means [8], and superpixel segmentation [17, 18], where a superpixel is defined as a group of connected pixels with similar gray-level intensity and texture [19]. Max-flow and graph-cut optimization were explored in Refs. [20, 21], while level-set modeling was used in Ref. [22].

Although these methodologies achieved encouraging segmentation performance, they suffer from variability in CMR-LGE images (e.g., in terms of noise and intensity level associated to HE areas), and/or require heavy operator intervention. This strongly hampers the translation of the developed methodologies into the actual clinical practice [22]. To tackle image variability and reduce operator intervention, the literature on medical-image segmentation is focusing more and more on deep-learning (DL) approaches based on convolutional neural networks (CNNs) [23].

Despite the potentiality of DL for medical-image segmentation, few DL-based attempts at scar segmentation from CMR-LGE images can be found in the literature. These attempts, both focused on quantifying atrial fibrosis, include [18, 24], that exploit CNNs for superpixels-based feature extraction and sparse auto-encoders for superpixel classification (i.e., each superpixel is classified as scar tissue or healthy myocardial tissue). Additionally, in Ref. [25] CNN-based image features were extracted from image square patches, and the features from one patch were classified with fully connected layers while the output class was assigned to the central pixel of the patch.

It is worth noting that these DL-based methodologies classify superpixels or image patches individually. To encode spatial-connection information while performing segmentation, advancements in DL in other fields (such as natural-image segmentation, and, more recently, medical image segmentation in other anatomical districts) have lead to the introduction of fully convolutional neural networks (FCNNs). In a FCNN, fully connected layers are replaced by upsampling layers to provide directly fast and accurate image segmentation [26, 27]. In Ref. [28], simulated LV scar tissue generated from LGE-CMR images obtained in healthy subjects was segmented using U-Net [29]. In Ref. [30], a first attempt that exploits recursive FCNN for the task of atrial-scar segmentation in clinical images was proposed, resulting in a sensitivity of 76.12%.

Considering the performance achieved by FCNNs in other fields, the aim of this paper was to investigate the feasibility and accuracy of FCNNs for LV scar segmentation in CMR-LGE images.

Fig. 1
figure 1

Fully convolutional neural network (FCNN) architecture. Numbers refer to the FCNN-module indexes

Methods

From the earlier attempts at using FCNNs for image segmentation [26], mainly relying on successful CNN architectures modified and tuned to include and train upsampling layers, several FCNN models have been proposed (e.g., [31, 32]). In this work, the efficient neural network (ENet), that was presented in Ref. [32] for natural-image segmentation, was exploited and modified. With respect to the first FCNN architectures for image segmentation (e.g. [31]), ENet also implements skip connections [33]: architectures with skip connections have been widely shown to outperform their plain counterpart (e.g., [32, 33]. While providing comparable accuracy to existing FCNN models, ENet was also demonstrated to be faster to be trained and able in reducing significantly the number of required floating point operations per second (FLOPs). Moreover, ENet has been already found able to provide good results for ventricle segmentation in CMR images [29].

The ENet architecture consists of a sequence of 7 different stages (Fig. 1). The first stage, called initial stage, consists of an inception module [34], that concatenates the results of convolutional layers of different receptive field size (i.e., 5 × 5, 3 × 3, and 1 × 1) to allow a reach (multi-scale) feature representation. With respect to ENet, our FCNN initial stage had only 13 convolutional layers in parallel to a max-pooling layer, thus resulting in 14 feature maps after concatenation, instead of the original 16. Indeed, ENet was designed for RGB images, while our FCNN was fed with CMR-LGE images with one videointensity channel only. Convolution was performed with 3 × 3 kernels with stride 2 and max pooling with non-overlapping 2 × 2 windows.

Fig. 2
figure 2

Workflow of the first segmentation protocol (Protocol 1). GT ground truth, CMR-LGE cardiac magnetic resonance with late gadolinium enhancement, FCNN Fully convolutional neural network

Fig. 3
figure 3

Workflow of the second segmentation protocol (Protocol 2). GT ground truth, CMR-LGE cardiac magnetic resonance with late gadolinium enhancement, FCNN Fully convolutional neural network. Myocardial masks are obtained, during both the training and testing phases, by the manually delineated myocardial contours

ENet stages from 2 to 4 act as encoders for feature extraction and consist of a series of bottleneck modules. As in the original paper [34], in the implemented FCNN configuration each bottleneck module had a main branch and a lateral branch, the latter consisting of a series of three convolutional layers. The output of each of the three convolutional layers was activated with the parametric rectified linear unit (PReLU) [35]. Prior to activation, batch normalization was performed as regularization technique. The main branch consisted of a max pooling layer followed by padding to match the dimensions of the convolutional-filter output. The output maps from the two branches were summed up and activated with PReLU. As shown in Fig. 1, stages 2, 3 and 4 were made of 5, 9 and 8 bottleneck modules, respectively.

Also stages 5 and 6 of ENet consist of a series of bottleneck modules (3 and 2, respectively), but these stages act as decoders performing upsampling. As in Ref. [35], here max unpooling and spatial convolution were present in the lateral and main branches of the bottleneck modules, respectively.

The last stage of the proposed FCNN consisted of a bare full convolution. The convolution kernel had height and width equal to the size of the CMR-LGE images, and 2 channels, as the problem addressed in this paper is a binary segmentation problem.

Segmentation protocols

After modifying the ENet architecture to deal with the scar-segmentation task, two segmentation protocols were investigated. The aim of the first protocol (Protocol 1) was to explore the potential of the presented FCNN to directly provide scar segmentation from CMR-LGE images. Thus, during training, the FCNN was fed with CMR-LGE images and the relative scar ground-truth (GT) masks. Scar-GT creation is explained in “Experimental setup”. The workflow of Protocol 1 is shown in Fig. 2.

The second protocol (Protocol 2) aimed to explore the potential of the FCNN in segmenting scar in a pre-defined LV myocardial region. With Protocol 2, the aim was to investigate if delimiting the search area for scar segmentation, as currently done in the semi-automated quantification methods described in “Introduction”, could provide more accurate segmentation results. Thus, Protocol 2 required a priori to manually identify the myocardial boundaries to delimit the search area. To this goal, prior to feeding the FCNN, CMR-LGE images were multiplied by the relevant binary myocardial masks (obtained, both for training and testing, through manually tracing of LV contours as explained in “Experimental setup”). The workflow of Protocol 2 is shown in Fig. 3.

Experimental setup

The CMR-LGE images analyzed in this study refer to 30 different patients (26 men and 4 women) acquired at the Centro Cardiologico Monzino hospital in Milan (Italy), for a total of 250 short-axis images. These patients were retrospectively selected from the hospital database, with inclusion criteria a diagnosis of ischemic heart disease with a consequent presence of nonviable scar tissue in LV myocardium. Image size was 256 × 256 pixels and all the images were used for the analysis (i.e., no slice selection was performed).

Fig. 4
figure 4

An example of the masks characterizing the scar presence as drawn by an expert cardiologist for three patients. Scar location and dimension vary from slice to slice and from patient to patient

For training and testing purposes, scar GT was obtained with manual tracing of scar contours by an expert cardiologist using Circle Cardiovascular Imaging v.5.6.Footnote 1 LV-myocardium contours for Protocol 2 were obtained in the same way. Examples of CMR-LGE images and relevant ground truth resulting from scar manual tracing are shown in Fig. 4.

Data pre-processing was performed prior to FCNN training and testing. In particular, CMR-LGE images were cropped to reduce the processing area, as commonly suggested in the literature [22]. Image cropping was fully automatic. First, LV diameter and center were retrieved using the circle Hough transform [36] from each CMR-LGE slice. Squared cropping was then performed by centering the crop area with the LV center and setting crop side length equal to double the LV diameter. To standardize the cropped-image size, as LV size varied from patient to patient, all the images were resized to 64 × 64 pixels, i.e., the minimum crop size found. The 64 × 64 images were processed by subtracting the intensity mean value from each image and normalizing by the intensity standard deviation.

To test the proposed segmentation approaches, the CMR-LGE image dataset was divided into two sets: the former was used for training and validation and the latter for testing purpose only. Considering the relatively limited number of CMR-LGE images available (even though comparable with similar work in the literature [22]), leave-one-patient-out cross-validation was used for robust performance evaluation. Thus, images from one patient were classified using the FCNN trained with all the images from the remaining 29 patients. This procedure was repeated for all the 30 patients.

Data augmentation was performed on the training set, by applying vertically flipping, horizontally flipping, and a combination of both, 90° rotation, 90° rotation and vertically flipping, 90° rotation and horizontally flipping, and 90° rotation with both vertically and horizontally flipping, for a total of 7 transformations.

FCNN training

Mini-batch gradient descent was used for FCNN training (for both the segmentation protocols), using a batch size equal to 4. Mini-batch gradient descent was chosen as a compromise between gradient descent and stochastic gradient descent, to provide fast training convergence while limiting the memory usage [37].

Cross entropy was used as loss function. The adaptive moment estimation (ADAM) [38] that adapts the learning rate by regularizing the gradient descent using both gradient amplitude and momentum, was used as training optimizer. To (upper) bound the learning rate during training, an exponentially decaying learning-rate bounding function was defined, using an initial learning rate equal to 5e−4. This resulted to be useful especially during the last training epochs to further reduce the loss, as commonly recognized in the DL literature [39]. As a regularization technique, a weight decay equal to 2e−4 for the FCNN convolution layers was imposed. The learning rate and weight decay were set as in Ref. [32], where they were demonstrated to consistently provide the best segmentation results. The FCNNs for both the two protocols were trained on 100 epochs. The best model among epochs according to the Dice similarity coefficient (DSC) [40] was then chosen.

FCNN training and testing were implemented using TensorFlow.Footnote 2 All tests were performed using NVIDIA® GeForce® GTX 1050 (4 GB GDDR5 dedicated) on a Intel® Core® i7-7700HQ (2.8 GHz, 6 MB cache, 4 cores) computer with 16 GB DDR4-2400 SDRAM. FCNN training took \(\sim 30\) h for each of the two protocols.

Evaluation

Inspired by similar work in the literature for scar segmentation (e.g., [18, 24, 25]), the segmentation outcomes, obtained with both Protocol 1 and Protocol 2, were quantitatively evaluated with respect to the GT in terms of pixel classification accuracy (Acc), sensitivity (Se), and specificity (Sp):

$${\text {Acc}}= {\frac{{\text {TP}} + {\text {TN}}}{n}}$$
(1)
$$\begin{aligned} {\text {Se}}= {\frac{{\text {TP}}}{{\text {TP}} + {\text {FN}}}} \end{aligned}$$
(2)
$$\begin{aligned} {\text {Sp}}= {\frac{{\text {TN}}}{{\text {TN}} + {\text {FP}}}} \end{aligned}$$
(3)

where TP and TN are number of scar and background pixels that were correctly identified, respectively. FP and FN are the number of background pixels classified as scar tissue and the number of scar pixels classified as background, respectively. The DSC, representing an overlap measure, was also computed as:

$$\begin{aligned} {\text {DSC}} = \frac{2{\text {TP}}}{{\text {FP}} + {\text {FN}} + 2{\text {TP}}} \end{aligned}$$
(4)

The Wilcoxon signed-rank test [significance level (\(\alpha\)) = 0.05] was used to assess whether significant differences existed in DSC among the segmentation results obtained with Protocol 1 and Protocol 2.

Table 1 Median (inter-quartile range) performance measures obtained for the first (Protocol 1) and second (Protocol 2) segmentation protocol

Results

Despite all patients were previously diagnosed with myocardial fibrosis in the LV, scar tissue was present only in 215 slices out of 250 (86% of the slices). The scar area in each slice ranged between 20 and 1259 pixels (pixel resolution: 1.49 × 1.49 mm). The dataset granted high intra- and inter-variability in both scar size and location in the LV, as can be seen from sample scar masks in Fig. 4.

Table 2 Median normalized contingency table for Protocol 1
Table 3 Median normalized contingency table for Protocol 2
Fig. 5
figure 5

Sample segmentation results obtained with Protocol 1 (red contour) and Protocol 2 (green contour) for four patients. The blue contour refers to ground-truth segmentation

Table 1 shows the performance measures obtained with Protocol 1 and Protocol 2. The normalized contingency tables are shown in Table 2 (Protocol 1) and Table 3 (Protocol 2). With Protocol 1, median Se and DSC were 68.77% (IQR 34.83%) and 54.00% (IQR 41.03%), respectively. Protocol 2 outperformed Protocol 1 significantly (p value < 0.05), with median Se and DSC equal to 88.07% (IQR 18.84%) and 71.25% (IQR 31.82%), respectively.

Protocol 1 and Protocol 2 failed in detecting the presence of scar tissue in 17 slices from 7 patients and 2 slices from 1 patient, respectively. Protocol 1 and Protocol 2 detected scar tissue when it was not present in 21 slices from 10 patients and 28 slices from 11 patients, respectively.

Some examples of segmentation outcome obtained with Protocol 1 and Protocol 2 for four patients, highlighting the observed scenarios, are shown in Fig. 5. In CMR-LGE slices from Patient 1 and Patient 2, the tendency of Protocol 1 (red line) in overestimating GT scar contours (blue line) can be observed. In Patient 3, both Protocol 1 and Protocol 2 detected scar tissue while this was not evidenced by the expert cardiologist in three out of eight slices. A relevant example is shown in Fig. 5 bottom left. For Patient 4, Protocol 1 failed (or barely succeeded) in detecting the presence of scar tissue in four slices out of nine. Two examples are shown in Fig. 5 bottom right.

Discussion

In this paper, the feasibility and accuracy of FCNNs for scar segmentation in CMR-LGE images were assessed. Accordingly, ENet was properly modified to be applicable to this kind of images, and evaluation of its performance in two parallel segmentation protocols was achieved. In this pilot study, as only 30 patients were included, data augmentation techniques allowed increasing the total number of available images up to 2000, and leave-one-patient-out cross-validation was the method of choice to guarantee proper analysis. Computational training time (about 30 h) could be considered acceptable, considering this approach and the use of not optimized computer architecture. Both segmentation protocols were in general able to detect scar tissue in the CMR-LGE images. Nonetheless, the performance achieved in Protocol 1 was lower than the one achieved in Protocol 2. This result was expected, as scar-segmentation algorithms in the literature (both traditional and based on DL) commonly require a priori knowledge of the LV myocardial position, defined by its myocardial borders, to define the regions of search for segmentation (e.g., [7, 8, 17, 18]). Indeed, in the CMR-LGE images, several structures are present surrounding the LV. Such structures have similar intensity and texture with respect to the scar tissue, making the task of Protocol 1 more challenging than the one of Protocol 2.

The results obtained with Protocol 2 [median DSC 71.25% (IQR 31.82%)] were in line with those [median DSC 64.00% (IQR 20%)] reported in Ref. [22] for seven semi-automatic approaches. The dataset in Ref. [22] was built with 15 subjects, for a total of 124 CMR-LGE training images (from 5 subjects) and 208 CMR-LGE testing images (from 10 subjects). A direct comparison was not possible as, to the best of the authors’ knowledge, the GT masks for the dataset presented in Ref. [22] were not provided for the testing images. Nonetheless, with respect to the state of the art approaches, both Protocol 1 and Protocol 2 were fully automatic and parameter free (i.e., not requiring additional region of interest selection in addition to the delineation of the myocardial contours). This is an undoubled advantage compared to threshold approaches (such as nST and FWHM) that require both user interaction for manual delineation of regions of interest in the LV myocardium for threshold computation and parameter tuning for setting the thresholds (e.g., number of standard deviations for nST). Similarly, clustering techniques such as GMM need to the define the number of GMM classes, which is not always trivial [41, 42]. With respect to other DL-based methodologies, such as [18, 24, 25], our approach directly provided the segmentation mask without requiring (1) pre-processing to extract and (2) post processing to merge superpixels or patches from the LV myocardial region. This was achieved by exploiting a fully convolutional architecture instead of an architecture based on CNNs with fully connected layers for classification tasks. This is widely recognized in the literature to simultaneously simplify, toughen and speed up both learning and segmentation [26].

A first limitation of this study, as also reported in Ref. [8], is related to the lack of a true GT for algorithm training and testing. In fact, having a GT for the scar-segmentation task is not trivial. Histological validation in animal models that has been considered as GT in other studies (e.g., [12, 43]) resulted to be inadequate for humans [44]. However, expert tracing of contours of the object of interest is widely considered as acceptable strategy to provide a reference for comparison. In this study, a possible limitation is constituted by the fact that scar manual tracing was performed by one clinician, while a more robust gold standard could be achieved by determining a consensus of the presence and scar morphology among several experts [45].

A second issue is related to the fact that our evaluation protocol was based on a limited number of CMR-LGE images. Even if such number was comparable with other approaches in the state of the art [22] (332 CMR-LGE images from 15 different subjects), a larger training dataset would allow encoding variability in image characteristics and scar size and position, to increase the segmentation performance. However, this initial work has to be intended as a proof of concept for the described methodology, and we are currently working with our clinical partners on expanding our training dataset.

As future extension of this work, once a larger dataset in terms of number of patients will become available, 3D FCNNs will be investigated to exploit the 3D information implicitly encoded in CMR-LGE data, as constituted by a stack of 2D images. In fact, 3D FCNNs were recently shown to provide encouraging results when applied to magnetic resonance volumetric images of the prostate [46], and thus adding the third dimension in the analysis (i.e., interpreting the image stack all together) of CMR-LGE images could result in possible improvements in pixel classification.

In conclusion, the proposed strategy for scar segmentation from CMR-LGE images based on FCNN showed a good performance, in particular once the process was guided by limiting the processing area to the myocardium only, such as in Protocol 2. These results are promising for application of deep learning techniques to this kind of medical imaging, and constitute the basis for future research involving larger training datasets.