Introduction

Gadoxetic acid-enhanced magnetic resonance imaging (MRI) has been widely used in patients with chronic liver disease and liver cirrhosis [1]. It is taken up by organic anion transporters into normal hepatocytes during the transitional and hepatobiliary phases and has unique characteristics that enable the detection of focal hepatic lesions and the assessment of hepatic function and chronic liver disease severity [2, 3]. Previous studies have introduced several methods to evaluate hepatobiliary phase uptake of gadoxetic acid as a noninvasive surrogate parameter for hepatic function. These studies focus on parameters such as relative liver enhancement, hepatic uptake index, contrast uptake index, liver-to-spleen contrast index, and T1 values [2, 4]. Although these methods demonstrate results quantitatively, they are time-consuming and depend on the vendor, magnetic field strength, and imaging sequence, making clinical application difficult.

Bastati et al. developed the functional liver imaging score (FLIS), a scoring system to evaluate liver function based on qualitative MRI features [5]. FLIS is the sum of three simple visual features evaluated the hepatobiliary phase of gadoxetic acid-enhanced MRI: enhancement quality score (EnQS), excretion quality score (ExQS), and portal vein sign quality score (PVsQS) [5]. This semi-quantitative scoring system makes it easier to evaluate hepatic function than other quantitative methods because it doesn’t need to measure signal intensity, calculate complex equations, or use dedicated software [6, 7]. FLIS was associated with probability of graft survival in liver transplant recipients, and also associated with first hepatic decompensation and mortality in advanced chronic liver disease patients [5, 8]. In addition, a newly suggested algorithm based on combination of FLIS and splenic diameter measured using MRI could stratify the risk of mortality in patients with advanced chronic liver disease [9]. Taken together, these results indicate that FLIS is a promising imaging biomarker.

High reproducibility is essential for reliable imaging biomarkers [10]. Previous studies have reported almost perfect inter-reader reliability of FLIS [5, 8, 9, 11,12,13,14]. However, the inter-reader reliability of FLIS was an ancillary finding in each study, and it has not been systematically determined. Nonetheless, FLIS is not expected to be affected by other factors, such as reader and MRI-related factors. Therefore, the purpose of this study was to systematically determine the inter-reader reliability of the FLIS and explore possible factors that affect it.

Materials and methods

This study was conducted and reported following the guidelines for Meta-analysis of Observational Studies in Epidemiology [15] and Preferred Reporting Items for Systematic Reviews and Meta-Analyses [16, 17].

Literature search

Original research articles reporting the inter-reader reliability of FLIS derived from gadoxetic acid-enhanced MRI were systematically searched in the MEDLINE and EMBASE databases. The representative terms used for the sensitive literature search were “functional liver imaging score,” “gadoxetic acid,” and “MRI,” and the detailed search query is presented in Supplementary Table 1. The literature search was limited to original studies on human subjects that were published in English. The search period began with studies published in January 2013 and was updated until June 2022. The bibliographies of the identified studies were reviewed to include additional eligible studies.

Eligibility criteria

Studies were included if the following criteria were met: (a) Population: patients who underwent gadoxetic acid-enhanced MRI for the evaluation of the hepatobiliary system or liver graft; (b) Index test: gadoxetic acid-enhanced MRI that included 20-min delayed hepatobiliary phase imaging; (c) Comparator: no requirements; (d) Outcome: inter-reader reliability of FLIS; and (e) Study design: any type of study including observational studies and clinical trials. Studies were excluded if they met the following criteria: (a) review articles, conference abstracts, letters, and editorials; (b) studies in which patient cohorts and data overlapped; (c) studies unrelated to the field of interest of this study; and (d) studies that did not provide sufficient data to determine inter-reader reliability. The titles and abstracts of potentially eligible studies were reviewed based on eligibility criteria before conducting full-text reviews of the remaining studies.

Data extraction

The following data were extracted from the final studies included using a predefined form: (a) study characteristics: author, year of publication, study design, study type, subject enrollment method, and country in which the study was performed; (b) demographic and clinical data: number of patients, patient age, and underlying hepatobiliary disease; (c) MRI data: vendor, type of scanner, magnet field strength (1.5-T or 3.0-T), and type of contrast agent; (d) image interpretation data: number of readers, reader experience, and clarity of blindness to reference standard during the review; and (e) study outcomes: inter-reader reliability of FLIS and its three subcategories including EnQS, ExQS, and PVsQS. The intraclass correlation coefficient (ICC) or kappa value (κ) with standard error was extracted to calculate the meta-analytic estimation of inter-reader reliability. Two reviewers independently performed data extraction, and cases of disagreement were resolved at a consensus meeting.

Quality evaluation

The quality of the eligible studies was evaluated according to the Guidelines for Reporting Reliability and Agreement studies [18]. Risk of bias in the following seven domains was evaluated: (a) index test; (b) study subjects; (c) readers; (d) reading process; (e) clarity of blinding during the review; (f) statistical analysis; and (g) the actual number of subjects. Details of the questionnaires for each domain are described in Supplementary Table 2. Each category was rated as high-quality when the study detailed measures to limit potential bias. Two reviewers independently performed the study quality evaluation, and disagreements were resolved at a consensus meeting.

Data synthesis and statistical analysis

R version 4.2.1, with the meta and metafor packages, was used to perform analyses. ICC or κ with standard error was summarized for the FLIS and its three subcategories (EnQS, ExQS, and PVsQS) from each study to calculate meta-analytic pooled estimates. When an original study did not report a standard error, it was estimated using the 95% confidence interval (CI). If a study only reported the inter-reader reliability of FLIS subcategories without reporting that of FLIS itself, the median of the reported subcategories was considered FLIS. Meta-analytic pooled estimates with 95% CI were calculated using the DerSimonian-Laird random-effects model with or without the Knapp and Hartung adjustment [19]. The meta-analytic pooled estimates were categorized based on Landis and Koch as follows: < 0.20, poor; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; and 0.81–1.00, almost perfect reliability [20]. Heterogeneity was evaluated using the Cochran Q-test and I2 statistics as follows: < 25%, low heterogeneity; 25–75%, moderate heterogeneity; and > 75%, high heterogeneity [21]. Publication bias was assessed using funnel plots and rank tests.

Meta-regression analyses were performed to explore the causes of study heterogeneity according to the following covariates: (a) subject enrollment (consecutive vs. selective), (b) country of study (western vs. eastern), (c) MRI magnet field strength (3.0-T vs. 1.5-T included), (d) MRI vendor (single vs. multiple), (e) MRI scanner (single vs. multiple), (f) number of readers (two readers vs. more than two readers), (g) difference in reader experience (all experienced readers in abdominal imaging vs. multiple readers with trainees), (h) average reader experience (≥ 9.6 years of experience in abdominal imaging vs. < 9.6 years, according to the mean 9.6-year reader experience of the included studies), and (i) homogeneity in reader experience (i.e., a difference in reader experience of less than 3 years vs. other).

Results

Literature search

Initially, 125 studies were identified through a systematic literature search. After removing 19 duplicate articles, 73 were excluded upon reviewing their titles and abstracts during the screening. Subsequently, 27 articles were further excluded after full-text review. Finally, six original studies with a total of 1419 patient data were included in this study [5, 8, 11,12,13,14]. The detailed study selection process is shown in Fig. 1.

Fig. 1
figure 1

Flow diagram of the study selection process

Characteristics of the included studies

The detailed characteristics of the included studies are summarized in Table 1. All the included studies were retrospective cohort studies [5, 8, 11,12,13,14]. Four studies enrolled subjects consecutively [5, 8, 11, 13], and two studies selectively enrolled subjects who underwent surgery or biopsy [12, 14]. Three studies were performed in Western countries [5, 8, 11] and three in Eastern countries [12,13,14]. Four studies used a single MRI machine [5, 8, 11, 12], and two studies used two different machines [13, 14]. All included studies used a standard dose of gadoxetic acid (0.025 mmol/kg; Primovist/Eovist, Bayer) as the contrast agent [5, 8, 11,12,13,14]. Four studies used two readers [5, 12,13,14], while the remaining two used more than two readers [8, 11]. All included studies supplied information about the details of each reader’s experience [5, 8, 11,12,13,14]. The experience level of each reader varied, ranging from trainee to 20 years of experience in abdominal imaging, with an average of 9.6 years. The readers in all included studies were blinded to the clinical information [5, 8, 11,12,13,14].

Table 1 Characteristics of the included studies

Study quality

All included studies demonstrated a quality score of five or more for the seven domains evaluated (Supplementary Table 2).

Meta-analytic pooled inter-reader reliability of functional liver imaging score

The meta-analytic pooled estimates of the inter-reader reliability of FLIS derived from gadoxetic acid-enhanced MRI are summarized in Table 2 and Fig. 2. The meta-analytic pooled inter-reader reliability of FLIS was 0.93 (95% CI, 0.88–0.98), showing almost perfect inter-reader reliability. In addition, the meta-analytic pooled inter-reader reliability of the three FLIS subcategories was as follows; 0.93 (95% CI, 0.85–1.00) for EnQS, 0.95 (95% CI, 0.91–1.00) for ExQS, and 0.90 (95% CI, 0.81–0.99) for PVsQS, also showing almost perfect inter-reader reliability.

Table 2 Summary of the meta-analysis
Fig. 2
figure 2

Meta-analytic pooled inter-reader reliability. a Functional liver imaging score, b enhancement quality score, c excretion quality score, and d portal vein sign quality score

Meta-regression analysis

The meta-analytic pooled inter-reader reliability of FLIS showed moderate study heterogeneity, which did not reach high study heterogeneity (I2 = 73.2). According to the meta-regression analysis, the subject enrollment method, MRI-related factors (vendor, type of scanner, and magnetic field strength), number of readers, difference in reader experience, average reader experience, and homogeneity of reader experience were not significantly associated with study heterogeneity (See Table 3).

Table 3 Meta-regression analysis of the functional liver imaging score

There was no significant publication bias regarding the inter-reader reliability of FLIS and its three subcategories (p > 0.44, Supplementary Fig. 1).

Discussion

This study demonstrated that FLIS and its three subcategories, enhancement quality score, excretion quality score, and portal vein sign quality score, derived from gadoxetic acid-enhanced MRI, had almost perfect inter-reader reliability, showing a meta-analytic pooled estimate of 0.90–0.95. Meta-analytic pooled inter-reader reliability of FLIS showed moderate study heterogeneity, but study methodology, MRI-related factors, and reader experience were not significantly associated with study heterogeneity.

In modern practice, MRI is widely used for diagnosis and follow-up in patients with chronic liver disease and cirrhosis. Under these circumstances, Bastati et al. introduced FLIS, a simple parameter derived from gadoxetic acid-enhanced MRI [5]. FLIS is directly associated with liver function and can predict the risk of liver-related complications or death [5, 8, 9]. These results suggest that FLIS is a promising imaging biomarker, and high reproducibility is essential for imaging biomarkers [10]. FLIS demonstrated almost perfect inter-reader reliability in this study, highlighting its reproducibility and robustness. The high reliability may have been associated with the simplicity and intuitiveness of FLIS as a scoring system. FLIS is a semi-quantitative parameter that does not require signal intensity measurements, complex equations, or specific software.

Bias among readers can cause changes in measurements because their subjectivity influences the test results [22]. It can result from differences in training, experience, and frames of reference between readers. However, in this meta-analysis, the difference in reader experience was not a significant factor in the inter-reader reliability of the FLIS. All covariates associated with reader experience, namely, differences in reader experience (all experienced readers vs. multiple readers with trainees), average reader experience, and homogeneity in reader experience (homogenous vs. heterogeneous), showed almost perfect inter-reader reliability. These results are consistent with previous studies in which there was no significant difference in the inter-reader reliability between board-certified radiologists and trainees [8, 11, 13]. Considering the results of previous studies and this meta-analysis, FLIS is a reliable and reproducible grading system that can be used independently of the reader’s experience.

FLIS was developed as an alternative to other complex and quantitative methods for evaluating the hepatobiliary phase uptake of gadoxetic acid [2, 4]. Thus, FLIS is designed not to be affected by MRI-related factors and is a simple visual assessment of the relative signal intensity of the liver parenchyma and portal vein and the presence of biliary secretion of contrast agents. This meta-analysis also showed that MRI-related factors, including vendor, scanner type, and magnetic field strength, did not affect the interpretation of the FLIS, resulting in high inter-reader reliability.

This study had some limitations. First, we could not include an original study that did not supply the standard variance of inter-reader reliability [9]. The standard variance and the ICC or κ from each study were needed to calculate meta-analytic pooled estimates of inter-reader reliability. Second, the meta-analytic pooled inter-reader reliability of FLIS showed moderate heterogeneity, and we could not identify the cause despite the robust meta-regression analysis. Nonetheless, because the inter-reader reliability of the FLIS from each original article before synthesis showed almost perfect reliability, moderate heterogeneity may not be a significant problem. However, further studies should be conducted to confirm any factors affecting the inter-reader reliability of FLIS. Third, some included studies reported the inter-reader reliability of FLIS subcategories only, without reporting FLIS itself. Therefore, we considered the median inter-reader reliability of the reported subcategories to be that of the FLIS.

In conclusion, the meta-analytic pooled estimate of the inter-reader reliability of FLIS and its three subcategories showed almost perfect reliability. Therefore, FLIS may be a reliable imaging parameter that reflects liver function and outcomes in patients with chronic liver disease. Further studies should be performed to confirm any factors affecting the inter-reader reliability of FLIS.