Liver fibrosis (LF) is the most frequent consequence of all chronic liver diseases (CLDs) [1], characterized by the excessive accumulation of extracellular matrix (ECM) [2], leading to the replacement of injured tissue by collagenous scar and the consequent liver architectural distortion. The major clinical consequences of cirrhosis are impaired liver function, portal hypertension, and the development of hepatocellular carcinoma (HCC) [3]. The end-stage LF is often considered irreversible with very limited effective treatment except liver transplantation, whereas early or intermediate hepatic fibrosis is usually a treatable complication [4, 5]. Therefore, early detection and staging of LF is crucial for therapeutic decision-making and monitoring treatment responses.

Currently, biopsy is considered the gold standard for assessing LF [6]. However, it is invasive and prone to sampling variability [7]. Therefore, noninvasive assessments for the evaluation of LF have become a heated discussed topic worldwide [810]. Diffusion-weighted imaging (DWI) is a specific functional magnetic resonance imaging (MRI) technique based on the principles of Brownian motion (random thermal diffusion) of small molecules in a tissue [11]. As a notable DWI-based imaging technique, Intravoxel incoherent motion (IVIM) analyses the signal decay of multiple b values to simultaneously evaluate the perfusion-related diffusivity (demonstrated by parameters D*, f) and pure molecular diffusivity (demonstrated by the parameter D) [12]. An increasing number of studies have been focused on the diagnostic performances of DWI for the staging of LF [13, 14], although discrepant results have been reported among those studies.

A previous study [15] compared the diagnostic accuracies of magnetic resonance elastography (MRE) and DWI for the assessment of LF, and concluded that MRE is more reliable for LF staging. In our opinion, with more studies and patients included in this meta-analysis, although the performance of DWI was limited, MRE is currently available only in selected centers while DWI is a widely available and easy-to-perform technique. Recent advances in DWI techniques have showed progresses for LF staging. This study aims to evaluate the diagnostic performance of DWI in LF staging and explore factors that may influence the diagnostic accuracy.

Materials and methods

Literature search and screening

A systematic literature search was performed by two investors independently in MEDLINE, Web of Science, EMBASE, Springer Link, and Science Direct to identify relevant articles published before February 2016 with the keywords “liver/hepatic fibrosis or cirrhosis” and “diffusion magnetic resonance imaging or diffusion-weighted imaging or DWI or apparent diffusion coefficient or intravoxel incoherent motion-DWI.” The research was limited to articles concerning humans with an abstract in English.

Two reviewers read the titles and abstracts of the yielded articles which addressed the diagnostic performance of DWI for staging LF in humans to select potentially relevant articles. The full set of selected articles was collected and reviewed independently by the same reviewers to determine their eligibility for further quantitative analysis. The inclusion criteria were as follows: (1) DWI was performed to identify LF; (2) sufficient data were available to calculate true-positive (TP), false-positive (FP), false-negative (FN), and true-negative (TN) values; (3) histopathology (METAVIR score) as the reference standards; and (4) the study population should be no less than 20. The exclusion criteria were as follows: (1) duplicate publication based on the same primary study; (2) articles with poor quality; (3) studies focused on children; and (4) nonoriginal researches including review articles, abstracts, letters, comments, guidelines and case reports. Investigators were not blinded to the information about the authors, the authors’ affiliation, or the journal name. Disagreements between the two reviewers were resolved by consensus. Investigators of the primary researches were approached for additional information, if necessary.

Data extraction and quality assessment

Data were extracted by the same two reviewers mentioned above independently. A senior radiologist with more than 20 years of experience in hepatic disease diagnosis was consulted to resolve discrepancy between the two reviewers.

To extract data concerning study characteristics, we recorded the patient information (study population, number of male and female patients, mean patient age with range and patient spectrum), study design (prospectively or retrospectively), score system for histopathologic staging LF, blinding procedure, reference standard (i.e., liver biopsy and/or surgery), and time interval between index test and reference standard. We also recorded the image protocols (magnetic field strength, b values, and MR scanner) adopted in the primary studies to perform DWI. For the calculation of diagnostic accuracy of DWI, we extracted available data on TPs, FNs, FPs, and TNs. We grouped accuracy results into five subgroups which were F0 = no fibrosis; F1 = portal fibrosis without septa; F2 = portal fibrosis and few septa; F3 = numerous septa without cirrhosis; and F4 = cirrhosis [16]. The 2 × 2 contingency tables were formed for the calculation of F0 vs. F1–F4 (≥F1), F0 and F1 vs. F2–F4 ((≥F2), F0–F2 vs. F3 and F4 ((≥F3), and F0–F3 vs. F4, respectively. The quality of the included studies was assessed according to quality assessment of diagnostic accuracy studies-2 (QUADAS-2) [17].

Statistical analysis

We first used a random-effects coefficient binary regression model to summarize the diagnostic performances. The summary receiver operating characteristic (SROC) curve was constructed and areas under the SROC curve (AUCs) of each LF stage served as the determination of the diagnostic performance of DWI [18].

Heterogeneity between the included studies was evaluated. Several potential sources may contribute to the heterogeneity, the first of which is the threshold effect. We confirmed the absence of threshold effect by not noticing the “shoulder-arm” shape in the SROC plane [19].

Apart from variations due to threshold effect, heterogeneity could be generated from other related factors. The heterogeneity was identified by the Q statistic of the χ 2 value test and the inconsistency index (I 2), and p < 0.1 or I 2 > 50% indicated the presence of heterogeneity [20]. If significant heterogeneity was detected, we then performed single-factor meta-regression analyses to determine factors that contributed to the heterogeneity and subgroup analyses to observe their quantitative effects on the diagnostic results [21]. The subgroup analyses of different LF stages included comparisons of (1) study design (prospective vs. retrospective); (2) blinding procedure (yes vs. unclear); (3) study population; (4) mean patient age; (5) patient gender (male vs. female); (6) reference standard (liver biopsy vs. surgery); (7) MR scanner; (8) MR field strength (1.5 T vs. 3.0 T); (9) number of b values; (10) maximal b value (b max ≥ 800 s/mm2 vs. b max < 800 s/mm2); and (11) MR modality (conventional DWI vs. IVIM).

Publication biases were assessed with the Deek’s funnel plots and Egger’s asymmetry tests [22]. An inverted symmetrical funnel plot with P > 0.05 was considered to indicate the absence of publication bias [23].

Results

Study selection and quality assessment

The systematic search initially yielded 301 results, of which 25 studies were included in this meta-analysis. All studies were published between October, 2007 [24] and February, 2016 [25]. The study flowchart is demonstrated in Fig. 1. The qualities of included studies were good. Quality assessment for the included diagnostic studies is presented in Table 1. Figure 2 shows a graphical display for QUADAS-2 results regarding the proportion of studies with low, high, or unclear risk of bias.

Fig. 1
figure 1

Study flowchart

Table 1 Quality assessment of the 25 included diagnostic studies
Fig. 2
figure 2

Graphical display for QUADAS-2 results regarding proportion of studies with low, high, or unclear risk of bias

Study characteristics

The 25 included studies [2448] involved 1833 patients, whose ages ranged from 16 to 89. 14 trials [24, 2628, 3034, 38, 41, 44, 45, 47] were prospective, with the remaining 11 studies [25, 29, 3537, 39, 40, 42, 43, 46, 48] retrospective. Blinding procedure was reported in 15 studies [24, 25, 2830, 3439, 43, 44, 46, 47] and the rest 10 unclear [26, 27, 3133, 4042, 45, 48]. The disease spectrum was restricted to chronic hepatitis in six trials [27, 2933], type 2 diabetic patients in one trial [28], and in the remaining 18 trials, there was no restriction (Table 2). The parameters of imaging acquisition was demonstrated on Table 3.

Table 2 Study and patient characteristics of included studies
Table 3 DWI imaging protocols

For study-level analyses, 12 studies [25, 26, 28, 29, 3335, 38, 4144] met the inclusion criterial for fibrosis stage ≥F1, 16 studies [2427, 29, 31, 34, 35, 37, 40, 4244, 47, 48] ≥F2, 18 studies [2427, 2931, 3439, 4244, 47, 48] ≥F3, and 12 studies [25, 29, 32, 34, 35, 37, 40, 4245, 48] F4. All patients had biopsy or surgery results as reference standards. Diagnostic results of each subset are presented in Table 4.

Table 4 Diagnostic results of different LF stages

Diagnostic performance

Pooled sensitivities with corresponding 95% confidence intervals (CIs) for LF ≥F1, ≥F2, ≥F3, and F4 were 0.78 (95% CI 0.75–0.82), 0.81 (95% CI 0.78–0.84), 0.71 (95% CI 0.67–0.75), and 0.80 (95% CI 0.75–0.85), respectively. Pooled specificities for LF ≥F1, ≥F2, ≥F3, and F4 were 0.78 (95%CI 0.73–0.82), 0.80 (95% CI 0.76–0.83), 0.84 (95% CI 0.81–0.86), and 0.77 (95% CI 0.74–0.81), respectively. According to the SROC curve, the AUCs of LF ≥F1, ≥F2, ≥F3, and F4 were 0.8554, 0.8770, 0.8836, and 0.8596, respectively. Forest plots of sensitivity, specificity, positive likelihood ratio (PLR), and negative likelihood ratio (NLR) of different subgroups are shown in Table 4. The SROC curves are shown in Fig. 3.

Fig. 3
figure 3

Summary receiver operating characteristic (SROC) curve for DWI in staging LF. The AUCs of LF ≥F1 (A), ≥F2 (B), ≥F3 (C), and ≥F4 (D) were 0.8554, 0.8770, 0.8836, and 0.8596, respectively, indicating a good but not excellent diagnostic accuracy

Heterogeneity assessing and meta-regression analysis

Highly significant heterogeneity was detected in this meta-analysis. Threshold effects of all the fibrosis stages were eliminated through the SROC planes, which showed no “shoulder-arm” shapes. The single-factor meta-regression analyses showed that for LF ≥F1, no factor contributed statistically significantly to heterogeneity; for both LF ≥F2 and ≥F3, the maximal b values were the most important variable source of heterogeneity; while magnetic field strength and the MR imaging protocols contributed mostly to the heterogeneity of F4. Study design, patient age, patient gender, reference standard, numbers of b values, and blinding procedure did not contribute statistically to the heterogeneity in any fibrosis stage.

Subgroup analyses

We performed subgroup analyses between different study characteristics in each fibrosis stage to evaluate their quantitative effects on heterogeneity. The sensitivity, specificity, and AUC of b max < 800 s/mm2 for LF ≥F2 were 0.75, 0.71, and 0.7994, respectively, and those of b max ≥ 800 s/mm2 were 0.85, 0.86, and 0.9183, respectively, and difference was statistically significant. The sensitivity, specificity, and AUC of b max < 800 s/mm2 for LF ≥F3 were 0.59, 0.83, and 0.8360, respectively, and those of b max ≥ 800 s/mm2 were 0.82, 0.84, and 0.9162, respectively, demonstrating a statistically significant difference. For LF = F4, statistically significant differences were detected between the diagnostic accuracies of different magnetic field strengths (p = 0.0354) and the MR modalities (p = 0.0335). The results of the subgroup analyses are presented in Table 5.

Table 5 Integrant results of the subgroup analyses

Publication biases

The funnel plot shows that studies were distributed symmetrically on a scatter plot. The p values of the Deeks’ funnel plot asymmetry test for LF ≥F1, ≥F2, ≥F3, and F4 were 0.35, 0.37, 0.82, and 0.29, respectively, which demonstrated no evidence of notable publication bias (Fig. 4).

Fig. 4
figure 4

Funnel plots for DWI in staging LF. The p values of the Deeks’ funnel plot asymmetry test for LF ≥F1 (A), ≥F2 (B), ≥F3 (C), and ≥F4 (D) were 0.35, 0.37, 0.82, and 0.29, respectively, demonstrating no evidence of notable publication bias

Discussion

DWI is a quick and repeatable noninvasive MR modality which enables qualitative and quantitative evaluation of tissue diffusivity without the use of gadolinium chelates. In LF staging, the apparent diffusion coefficient (ADC) of the fibrotic hepatic tissue is usually significantly lower than that of normal liver tissues, and the ADC values decrease as the fibrosis score increases [49, 50]. A possible explanation for this phenomenon suggested that in fibrotic liver tissues, with the presence of increased proton poor connective tissue, the molecular diffusion and the blood flow were restricted [11, 12, 51], leading to the decreased ADC in these tissues. Previous studies have shed light on the feasibility of DWI in the staging of LF, monitoring treatment responses and follow-up of patients with LF [2448].

In this meta-analysis, we first explored the ability of DWI in LF staging. A diagnostic tool is defined as perfect if the AUC is 100%, excellent if the AUC is greater than 90%, and good if the AUC is greater than 80% [52]. According to this, the results showed that DWI had good but not excellent diagnostic accuracy for LF staging.

To our knowledge, there have not been any standardized DWI techniques as yet, and a large variety of imaging parameters exist for DWI in the number and size of b values, and diagnostic threshold for different tissues, organs, and diseases. At least two b factors are required for the calculation of ADC when performing DWI. Although several b values are often used in practice to perform a linear regression analysis to lower perfusion contamination and regional ADC variation to consolidate the ADC evaluation [53], our meta-regression analyses revealed that the number of b values did not statistically significantly correlate with the diagnostic performances of DWI in the staging of LF.

Typical b values for LF imaging vary from 0 to 1000 s/mm2. In subgroup analysis, our study revealed that for fibrosis stage ≥F2 and ≥F3, b max ≥ 800 s/mm2 performed statistically significantly better compared to b max < 800 s/mm2 in the staging of LF, indicating that adopting b max between 800 and 1000 s/mm2 could significantly optimize the diagnostic accuracy of DWI in the staging of LF for significant and severe fibrosis (F2 and greater). In clinical practices, high accuracies in the detection of ≥F2 and ≥F3 are essential. Owing to cost, risk of toxicity, and limited efficacy, Kim et al. [54] suggested that only hepatitis C patients whose LF ≥F2 should receive antiviral treatment. Moreover, significant fibrosis (F2) is usually considered as a hallmark of a progressive disease, and the major treatment for this fibrosis stage is resolving the underlying cause of liver disease [15, 55]. Apart from these, discrimination of advanced fibrosis (F3) or cirrhosis (F4) is essential because those patients should be screened for portal hypertension and HCC [55].

This finding was in accordance with previous studies. Ozkurt et al. [56] used different b values including 250, 500, 750, and 1000 s/mm2 in their study and found that the negative correlation between the fibrosis score and ADC values were significant only in b values of 750 and 1000 s/mm2. Taouli et al. [24] reported that the ADC value was significantly correlated with the LF stage with b values of at least 500 s/mm2, and the ADC value with the highest significant correlation with fibrosis stage was acquired from a combination of b values of 0–1000 s/mm2. Other studies which involved low b values (e.g., 0–128 s/mm2 and 50–400 s/mm2) reported that significant correlations with the ADC values and hepatic fibrosis stage were not achieved [30, 38, 46, 53]. A possible explanation for this is that with low b values, blood flow will contribute more to the signal attenuation. Therefore, relatively small b max could increase the amount of perfusion contamination in ADC measurement [11]. However, we did not identify statistically significant differences between the diagnostic accuracies of b max ≥ 800 s/mm2 and b max < 800 s/mm2 in other LF stages. This could be due to that fibrosis is not the only source of altered diffusion properties and ADC values in cirrhotic liver. Previous studies have reported that increased hepatic inflammation degree [26, 29, 30, 32, 43] and liver fat content [57] may lead to reduced ADC as well.

The subgroup analyses also revealed that for cirrhosis patients (F4), IVIM performed statistically significantly better than the conventional DWI model to evaluate liver cirrhosis (Table 5). According to equations defined by Le Bihan et al. [58], f is the fraction of microcirculation (perfusion)-related diffusion, D is the diffusion parameter of pure molecular diffusion (slow component of diffusion), and D* that of the perfusion-related diffusion (fast component of diffusion). Liver diffusion combines both pure molecular diffusion and capillary perfusion [51], and with IVIM, pure molecular diffusion can be separated from perfusion-related diffusion with the use of a wide range of b values (including low [<200 s/mm2 ] and high [≥200 s/mm2]). Prior studies [28, 34, 35, 3941, 45] have shown that IVIM-derived D* was significantly lower in the fibrotic liver tissues than in the nonfibrotic liver tissues. Moreover, Luciani et al. [12] applied the IVIM model and reported that ADC changes observed in liver cirrhosis were more reflective of a decrease in capillary perfusion than in pure molecular diffusion. However, we failed to identify statistically significant difference between D* and ADC value in other fibrosis stages; this could be on account of the relatively poor reproducibility of D* and limited numbers of studies and study population included in our meta-analysis. Therefore, further studies concerning IVIM with better measurement precision and larger study cohort are necessary to further demonstrate the diagnostic performance of IVIM in the staging of LF.

In subgroup analyses, we also compared the effects of two magnetic field strengths: 1.5 and 3.0 T. Our study showed that in liver cirrhosis (F4) group, high-field strength (3.0 T) demonstrated statistically significantly higher sensitivity, specificity, and AUCs compared with low field strength (1.5 T) (Table 5). High-field imaging enables higher signal-to-noise ratio (SNR) [5961] which either increases spatial resolution or SNR in the ADC maps. However, we failed to identify statistically significant difference between the diagnostic accuracies of 1.5 T and 3.0 T in other fibrosis stages. One possible explanation for this is that in high-strength field, echo-planar imaging results in increased susceptibility artifacts, thus nonecho-planar imaging sequences may optimize the worse image quality and optimize the diagnostic performance at higher field [11]. Therefore, despite the increased availability of 3.0 T imagers, data were still limited on the use of 3.0 T DWI assessing LF, and improved acquisition techniques are required.

However, there are several challenges using DWI to assess LF. First, the acquisition of ADC relies on several imaging parameters including field strength, repetition time, echo time, and b values, thus the reported ADCs of previous studies are variable with considerable overlap between normal and abnormal ranges [13, 24, 30, 46, 47]. Second, images with sufficient quality for reliable quantitative analysis are hard to obtain because DWI is sensitive to susceptibility and motion-related artifacts [13]. Third, a number of potential confounding factors including hepatic perfusion effects, steatosis, edema, hepatic iron, and hepatic necroinflammatory alterations can influence the accurate interpretation of ADC values. Therefore, further studies are required to validate the diagnostic performances of DWI in the assessment of LF and develop standardized DWI methods across different imaging centers.

Our study has limitations. First, the number of studies with high-field strength or IVIM is limited, and the majority of them were overlapped, so it is hard to tell which factor contributed principally for the improved diagnostic performance. Secondly, although QUADAS-2 was adopted to confirm the quality of included studies, there were still many retrospectively designed or unblinded studies included. Therefore, to take full advantage of the benefits of high-field strength and new protocols, prospectively designed large-scale studies specifically addressing those factors are needed in future.

In conclusions, through a larger sample size, comprehensive statistical analysis and the inclusion of high-field scanners and modified IVIM protocol, this study revealed the currently good diagnostic performance of DWI for LF staging, indicated the value of high b value diffusion imaging, and presented the potential role of high-field strength and IVIM for future fibrotic liver imaging.