Introduction

On the assessment of chronic diffuse liver diseases, such as non-alcoholic fatty liver disease (NAFLD), non-alcoholic steatohepatitis (NASH), or iron overload, magnetic resonance (MR) imaging has an important role in the evaluation of parenchymal deposits, including fat and iron [1]. NAFLD and NASH may progress towards increasing stages of hepatic fibrosis and, finally, to cirrhosis and HCC development [2]. Proton density fat fraction (PDFF) and R2* relaxation rate are used as reproducible quantitative metrics, obtained from MR images, for steatosis and iron concentration estimations, respectively. MR imaging biomarkers take an important role in the diagnosis and treatment monitoring of patients suffering from chronic diffuse liver diseases [3,4,5,6], offering additional information to the qualitative diagnosis made by radiologists.

Liver biopsy has important limitations related to the invasiveness of the procedure, the sampling bias due to the heterogeneous distribution of different histological features within the liver parenchyma, the intra- and inter-observer variability of pathological grading system, and the large patients’ unacceptance and withdrawal [7,8,9]. On the other hand, the so-called virtual biopsies, carried out through MR images and computational methods, can be performed multiple times and evaluate the heterogeneous distributions with high patients’ acceptance.

Multi-echo chemical shift–encoded gradient-echo (MECSE) MR images allow for the simultaneous and precise quantification of fat and iron within the liver parenchyma [4, 5, 10]. PDFF and R2* measurements are highly related to the histopathological grading systems, allowing MR virtual biopsy to become a common procedure performed in clinical practice for the assessment of chronic diffuse liver diseases [4, 5].

Liver PDFF and R2* quantification is usually obtained from small regions of interest (ROI); however, there is a lack of standardization when placing the ROIs across the liver (different sizes, locations, and number), introducing subsampling biases and subjective region selection, increasing variability across sites and studies, and reducing the reproducibility and repeatability of the imaging biomarkers quantified [11, 12]. To obtain an estimation of the patient liver status and to grade the heterogeneity of fat or iron distribution, MR virtual biopsies should evaluate the imaging biomarkers across the whole liver in a voxel-wise approach. Liver segmentation is nowadays done in a manual or semi-automatic way, hindering radiologists’ workflow.

The main difficulties for automatic segmentation of the liver in these MR images are related to its similarities with the surrounding organs, aggravated by the low image spatial resolution and large slice thickness, which causes partial volume effects. Moreover, diseased livers have different shapes, morphologies, and sizes among patients.

Traditionally, model-based [13], atlas-based [14,15,16], and level set–based [17, 18] methodologies have been used to segment the liver on both MR and CT images. Unfortunately, these methods entail high computation costs and long execution times, being a challenge to generate an atlas model able to comprehend the huge variability in liver morphology within the population, failing in the generalization to all liver signal intensities and shapes.

Learning-based methodologies [19,20,21,22,23,24,25] have been proposed to automate some tasks traditionally done by radiologists [26]. Regarding organ segmentation, artificial intelligence (AI) and mainly convolutional neural networks (CNNs) are able to model all variations found on a training dataset and to perform an automatic segmentation in some seconds without high computational needs. Up to now, most of the developed learning-based CNN methods for liver segmentation are built using computed tomography (CT) images [19,20,21,22,23]. The ones developed over MR examinations [22,23,24,25] used different sequences including diffusion-weighted [24], T2-weighted [23, 24], and dynamic contrast-enhanced [23, 25] exams.

Our hypothesis is that MECSE-MR images, needed to precisely quantify liver fat and iron deposits, can be used for whole liver segmentation by using a CNN solution to automatically extract imaging biomarkers. The aim of this study is to develop and validate, with both internal and external independent cases, a novel CNN-based model trained to generate a whole liver virtual dissection and quantification.

Materials and methods

A retrospective multicenter and international collection of 182 patients with suspected diffuse liver disease and a 2D-MECSE-MR examination was conducted (Table 1), only patients without intrahepatic lesions were included. A first cohort of 153 studies, from two different centers, was selected to create and test the model. A second cohort included 29 studies from two independent centers in different countries for external validation purposes.

Table 1 Acquisition protocols of MR imaging studies used for both the model development and the external validation. A total of 4 MR scanners were used, one per center. The first column includes information from scanners the two scanners used at two different hospitals (Spain and Portugal) in the model development and test dataset; the second and third columns include the acquisition protocols used for the external validation with two other MR scanners at two distant sites (Argentina and Chile)

Dataset preparation

All the studies were manually segmented by a radiologist with more than 5 years’ experience on abdominal MR imaging. The first echo time was selected as the reference image for manual segmentation due to its higher signal-to-noise ratio [23]. The liver parenchyma was delineated, avoiding the gallbladder and large vessels.

The different splits of the whole dataset required for the development and validation of the model are illustrated in Fig. 1.

Fig. 1
figure 1

Datasets for the different training, testing, and validation of the model

Convolutional neural network architecture

The proposed architecture was an encoder-decoder CNN [27] with four convolutional blocks on each branch.

To increase the network generalization and reduce overfitting, a normalization of the activations was applied after each convolutional layer including a batch normalization layer [28].

Furthermore, to achieve an acceptable boundary detection and a faster convergence of the network, a deep supervision [29, 30] path was included. 1 × 1 convolutional layers were added at the output of each decoder level and summed to achieve the final activation map.

Preprocessing and data augmentation

All the images were resampled to a shape of 192 × 192 by applying a bicubic interpolation and normalized to (0–1) range.

To avoid overfitting and to increase the network generalization, on-the-fly data augmentation was applied during the training process. The transformations applied include rotations between ± 5° and Gaussian noise addition (μ = 0; σ є [0.001, 0.01]) to the training images.

Model training and validation/tuning

To analyze the generalization and robustness of the designed architecture to new unseen data and to tune some CNN hyperparameters, a 5-fold cross-validation strategy was followed.

The training was performed along 300 epochs with a batch size of 40 images. The loss function to optimize on each iteration was based on the Dice coefficient (DC) [31]. An ADAM optimizer [32] was used during the training process. The initial learning rate was set to 1e-5 and the remaining hyperparameters were kept with their default values [32].

PDFF and R2* quantification

Both PDFF and R2* were quantified voxel-wise in all MR studies using QUIBIM Precision – Liver fat and iron v1.0.0 analysis module (QUIBIM), approved as medical device with CE mark class IIa. The median liver value was used for each patient. Different ranges to differentiate steatosis and siderosis grades were defined following the thresholds introduced in [4]:

  • Steatosis (PDFF): none (< 4.8%), low (4.8–8.5%), mild (8.6–12.9%), high (> 12.9%).

  • Siderosis (R2*): none (< 42 s-1), mild (42–91 s-1), high (> 91 s-1).

To evaluate automatic segmentation performance, PDFF and R2* were quantified over both CNN-based and manual liver masks.

Heterogeneity assessment

The test dataset was used to analyze if the trained model was able to generalize to cases in which fat and/or iron were distributed heterogeneously. Four circular ROIs (7cm2) were drawn across the liver (two ROIs within the left lobe and two ROIs within the right lobe) avoiding non-liver parenchyma regions. The median value of each ROI was calculated and the difference between the lowest and highest values was extracted (ΔhDFF, ΔR2*).

Testing and performance evaluation

Seventy-six previously unseen cases were used to evaluate the network performance. After image normalization, all MR series were segmented with a 2D slice-by-slice approach using the trained network and stacked to finally obtain the whole liver mask. At the CNN output, a probability liver map was obtained. To increase robustness in the quantification, a more conservative segmentation was chosen ensuring that all the segmented voxels belong to liver parenchyma, without worrying if some peripheral voxels corresponding to the liver were missed. For that, all the voxels with a probability higher than 90% were defined as liver.

After segmentation, along the 3D mask, some small isolated segmented regions occasionally appeared. Therefore, all regions with a volume lower than the biggest component were removed, leaving a single large liver region.

Finally, to evaluate the performance and generalization of the trained network, different parameters were calculated to measure the differences between the CNN-based and the manual (ground truth) masks: dice coefficient (DC), volume overlap error (VOE), relative volume difference (RVD), average symmetric surface distance (ASSD), maximum surface distance (MSD), false discovery rate (FDR), Spearman correlation index (r) for liver volume, PDFF and R2*, and relative error in the quantification of PDFF and R2*.

Results

PDFF and R2* quantification

For each subset of data, the percentage of patients belonging to each steatosis and siderosis grade was analyzed using the median values obtained with the manual masks and the defined thresholds. Different PDFF and R2* distributions are found along the subsets (Fig. 2); therefore, the reproducibility of the model and its generalization to different steatosis and siderosis grades were analyzed.

Fig. 2
figure 2

PDFF and R2* distributions on the different subsets created from the whole dataset

Heterogeneity assessment

When analyzing PDFF and R2* on four different ROIs across the liver, a ΔPDFF of 4.01 ± 1.75% (mean ± SD) and a ΔR2* of 11.13 ± 7.45s-1 were obtained, being the maximum difference 8.57% for the PDFF and 38.56s-1 for the R2*. Whether the steatosis and/or siderosis grades are associated with each patient changed depending on the ROI value used was analyzed. Fifty-five of the cases suffered a steatosis grade change, while 33% had a change in the siderosis grade. Therefore, there are some cases in the test dataset where PDFF and/or R2* were distributed heterogeneously and the generalization to these cases was proved.

Model validation

The network robustness to different datasets was evaluated comparing the DC calculated over the training and tuning-validation datasets on each fold of the 5-fold cross-validation. Table 2 shows, on each fold, the highest mean DC obtained over the validation dataset and its corresponding value over the training dataset. The DC showed similar values along the different folds in both datasets, while maintaining similar results on each fold independently.

Table 2 Highest mean dice coefficient obtained on each fold from a cross-validation training over both training and internal validation images using the training-validation dataset

Model testing

The mean, standard deviation, and median values of the different metrics are summarized in Table 3. The median DC was 94% with a FDR of 4%, meaning that the number of non-liver segmented voxels was minimized while maintaining a good segmentation performance.

Table 3 Results obtained for the segmentation assessment over the testing dataset using a threshold of 0.9

Furthermore, a low relative error in both PDFF and R2* quantification was obtained, with mean ± SD values of 2.8 ± 3.25% and 0.59 ± 0.52% and median values of 1.4% and 0.47%, respectively.

PDFF and R2* correlations between the results obtained when analyzing the liver using the manual segmentation and the CNN mask are shown in Fig. 3 (upper row), showing a perfect correlation on both components (PDFF and R2*: r = 1, p < 0.001). The correlation between the liver volumes is also represented, obtaining a correlation value of r = 0.98 (p < 0.001).

Fig. 3
figure 3

Correlation analysis over the testing dataset for the liver volume (left), PDFF (middle), and R2* (right) quantification when comparing the results obtained when using the ground truth and the predicted segmentation mask. In the upper rows, the correlation between manual-based and automatic-based quantitative data is represented. In the bottom rows, Bland-Altman plots to compare both variables of each parameter are included. Within the correlation images, in x- and y-axes respectively, the histogram of the variables obtained with the ground truth and the CNN predicted mask is represented

Bland-Altman plots (Fig. 3, lower row) show the difference in the quantification of liver volume, PDFF, and R2*. These parameters show a low bias (volume: 0.07L, PDFF: 0.12% and R2*: 0.27s-1) and narrow limits of agreement (volume: [- 0.11 to 0.26] L, PDFF: [- 0.12 to 0.36]% and R2*: [- 0.18 to 0.71]s-1). Similar differences are obtained along the whole spectrum for each variable (no differences are seen between low and high volume, PDFF, and R2* values).

Figure 4 illustrates three different test studies. As seen, large vessels, such as the inferior vena cava, were excluded from the predicted segmentation mask.

Fig. 4
figure 4

Segmentations obtained on three different test cases at different anatomical levels. In yellow the CNN predicted segmentation mask is shown and in red the contour of the ground truth is represented. a Results in a study from one of the centers used for the model development. b Results in a study from the other center used for the model development. c Performance on a study which belongs to a patient with partial hepatectomy

External validation

The reproducibility of the trained model was evaluated using an additional MR dataset from different centers and scanners (Table 4). The results obtained with a 1.5-T scanner are lower than those obtained with a 3-T scanner (DC: 0.93 ± 0.03 Philips 3T, 0.87 ± 0.06 Philips 1.5T). However, although the model was trained with Philips MR studies, the results with the Siemens scanner were higher (DC: 0.93 ± 0.03 Philips, 0.94 ± 0.02 Siemens).

Table 4 Results obtained over the external validation dataset for the segmentation assessment differentiated by the two different international centers. Upper rows: institution 1 (Siemens 3T); bottom rows: institution 2 (Philips 1.5T)

In the assessment of PDFF and R2* quantification, a low relative error was seen. At institution 1, the mean ± SD relative error was 3.24 ± 3.93% for the PDFF and 0.80 ± 0.85% for the R2*, while the median values were 1.99% for the PDFF and 0.53% for the R2*. At institution 2, the mean ± SD values were 3.42 ± 3.99% for the PDFF and 0.86 ± 0.76% for the R2* and the median values were 1.96% for the PDFF and 0.73% for the R2*.

Figure 5 shows the correlation between the results obtained in the liver volume, PDFF, and R2* quantification, showing a nearly perfect correlation between the results obtained with both masks. At institution 1, the correlation between volume values was r = 0.96 (p < 0.001), while at institution 2, it was r = 0.88 (p < 0.001). At both institutions, both PDFF and R2* showed a perfect correlation (r = 1; p < 0.001). A Bland-Altman analysis was also conducted to analyze the differences in the quantification of the three parameters. For all the cases, the results showed a low bias value and narrow limits of agreement. In addition, similar values are obtained in the whole spectrum for each parameter.

Fig. 5
figure 5

Differences between the liver volume (left), PDFF (middle), and R2* (right) values obtained using the ground truth and the CNN predicted mask at institution 1 and institution 2. Within each institution, the upper row shows the correlation analysis between both values and the bottom row contains the Bland-Altman plots. In the correlation images, in x- and y-axes respectively, the histogram of the variables obtained with the ground truth and the CNN predicted mask is represented

The mask obtained on a study from each center is represented in Fig. 6.

Fig. 6
figure 6

Segmentation obtained on two studies from the external validation at different anatomical levels. In yellow the predicted segmentation mask is shown and in red the contour of the ground truth is represented. a Study acquired on a Siemens 3-T MR scanner. b Study acquired on a Philips 1.5-T MR scanner

Processing time

The overall processing time for the whole 3D liver CNN segmentation and quantification was 26 ± 6 s (mean ± SD).

Discussion

A novel methodology for an automatic vendor and field strength–neutral segmentation of the liver parenchyma on low-contrast MECSE-MR images based on CNN has been proposed. This automatic segmentation allows the quantification of fat (PDFF) and iron (R2*) in the assessment of diffuse liver diseases.

Different learning-based models have been reported for liver segmentation on a variety of MR series [22,23,24,25]. However, only one publication used a MECSE-MR sequence [22] with a DC of 0.93 ± 0.04, the same as us in our testing dataset. Although an external validation with different MR scanners was also performed, no results per scanner were provided. The mean DC reported in [22] for the external validation was 0.92 ± 0.05, while in our external validation, we achieved a mean DC of 0.87 ± 0.06 in a Philips 1.5-T scanner and 0.94 ± 0.02 in a Siemens 3-T scanner.

Previous studies focused on the maximization of the segmentation metrics, not focusing on using the mask to extract imaging biomarkers. The main objective of our method was achieving the highest accuracy in liver segmentation reducing false-positive areas to avoid errors and obtain the highest precision on PDFF and R2* quantification. Accordingly, the obtained liver mask has to be accurate but conservative, without including neighbor structures such as gallbladder or large vessels, even at the cost of losing small peripheral liver areas. The results obtained in this study show a high correlation and a quite low relative error when comparing the PDFF and R2* liver values quantified using the manual and the CNN-based mask.

A study by Stocker et al [33] shows the differences obtained when quantifying PDFF and R2* using an in-house automatic tool for liver segmentation and quantification. When comparing the Bland-Altman plots, narrower limits of agreement were obtained for both PDFF and R2* in our study, meaning that the differences between the results from the manual and the automatic tools were lower in our solution.

Noteworthily, the reproducibility of the developed model was analyzed using MECSE-MR studies from external centers with different scanners from the ones used during training-tunning. CNN-based models are known to offer different results when small changes are applied to the input images [34, 35], being very sensitive to training data properties. For this reason, we constructed a dataset covering a large population spectrum, from different centers and scanners, applying different preprocessing techniques such as data augmentation and image normalization to improve the reproducibility of the model.

The lower precision obtained with the 1.5-T MR scanner can be related to the fact that these opposed-phase images were acquired with a lower in-plane resolution (2.3 × 2.3 mm2) and a higher slice thickness (10-10.5 mm), aggravating partial volume effects and, therefore, making difficult tissue differentiation. To increase the model generalization to these acquisitions, the CNN model should be further tuned and retrained with images from scanners with lower field strength magnets and lower image resolution. Another option would be the application of recently published solutions to perform a manufacturer shift using different CNN-based style transfer applications based on generative adversarial networks (GANs) [36]. However, when comparing the PDFF and R2* values quantified from the manual ground truth and the predicted mask, the high correlation between values and the low relative error in the quantification foster its applicability in clinical practice. Additionally, the CNN solution showed generalization to other manufacturers; the results obtained with Siemens images were better than those obtained with Philips scanner. It is also relevant to note that the training and internal validation processes were conducted using images from Spain and Portugal (Europe), while the external validation faced studies from Argentina and Chile (South America), fostering the international generalization of the tool.

In this way, the automatic quantification of PDFF and R2* allows the generation of virtual biopsies, facing the main problems that pathological biopsies entail, offering a huge advance on the assessment of diffuse liver diseases. Moreover, any error obtained in the CNN-based segmentation can be manually edited by an expert before performing the quantitative analysis over the new corrected mask, ensuring the validity of the results.

It is well-known that other 3D-MR sequences are commonly used. Since the proposed algorithm is based on a 2D CNN, the MR images are segmented slice-by-slice. Therefore, since the CNN has learned to segment independent slices from different anatomical levels, the expansion of this solution to 3D acquisitions is direct.

The main limitation of this study relates to the limited sample size, with a lack of studies coming from some other manufacturers and from scanners with different MECSE-MR acquisition protocols. Other next steps include liver parcellation in left-right lobes, and in the different hepatic segments. A complete liver volumetry together with a separate PDFF and R2* quantification will improve the heterogeneity evaluation of fat and iron distribution. Furthermore, the segmentation of the lesions will be of interest to exclude them from the liver parenchyma quantification.

In conclusion, whole liver parenchyma can be automatically and accurately segmented using CNN before imaging biomarker extraction. Deep learning liver virtual dissection allows the creation of automatic pipelines to characterize diffuse liver diseases through the quantification of fat fraction and iron concentration by calculating the PDFF and R2* in a voxel-wise manner. This model allows MR virtual biopsy to become a fast and automatic procedure for the assessment of chronic diffuse liver diseases in clinical practice.