Introduction

The incidence of hepatocellular carcinoma (HCC) has risen rapidly over the last few years, especially in the USA [1, 2].

In Sweden, a long-linear model used to estimate a more correct incidence of hepatocarcinoma, showed that hepatitis C-associated liver cancer increased and constituted 20% of cases in 2010 [3].

HCC was the sixth most common cancer and the fourth leading cause of cancer-related deaths worldwide in 2018 [4].

The distribution of HCC varies according to geographic location, to viral hepatitis and to the age at which it was acquired. Chronic hepatitis B virus (HBV) and hepatitis C virus (HCV) infections represent the leading cause of HCC (60–70%). In most of Africa and Asia, HBV is the single leading risk factor for HCC, whereas in Japan, northern Europe and the USA, HCV is the major risk factor [5].

In 80–90% of cases, HCC occurs in the setting of cirrhosis [6]. However, a stepwise progression of hepatocarcinogenesis has been established [7], so when a hepatocellular nodule is detected, monitoring is recommended, in order to diagnose premalignant nodules and early HCC, when effective therapy can be applied.

Periodic serologic and imaging tests for patients known to be affected by chronic liver diseases are recommended and widely implemented in current clinical practice. However, the optimal surveillance interval and surveillance tools for HCC have not yet been standardized [8].

CT and MR are the imaging techniques that often allow making a definite diagnosis of HCC without a need to biopsy the lesion. HCC is the only tumor that can be diagnosed with imaging alone, without the need for histopathological confirmation [9].

MRI has been proposed as a sensitive (81%) and specific (85%) imaging modality for the evaluation of liver nodules in patients with cirrhosis [10, 11].

This is based on the unique properties of MR imaging resulting in a high intrinsic soft tissue contrast between normal liver parenchyma and liver lesions, which can be further enhanced with intravenous administration of non-specific (extracellular) and liver-specific (hepatobiliary) gadolinium-based contrast agents [12].

Several scientific organizations and societies have proposed diagnostic systems for the interpretation of liver examination, in order to reduce imaging interpretation variability and improve communication with clinicians [13].

Since March 2011, the Liver Imaging Reporting and Data System (LI-RADS) scale has been adopted by many clinical practices [14]. It is a system for standardizing the report, performance and diagnostic interpretation of liver nodules, using fixed criteria [15, 16]. The first version of the LI-RADS was launched in March 2011, an update was released in 2013 and 2014 [14], and the latest update was released in 2018 [15].

A different scale of diagnostic interpretation, adopted in many fields of research, is the Likert one [17, 18]. Our purpose was to compare the performance of readers using a LI-RADS scale with that of readers using a Likert scale, in a cohort of patients with chronic liver disease who had undergone MRI for the discovery of a nodule in the sonography examination, using the histopathologic results from biopsy/liver transplant or an MRI follow-up over 4 years as reference standards.

Materials and methods

Patients

This retrospective study was approved by our institutional review board. We reviewed patients with cirrhosis, with no history of previous HCC, who underwent a MR examination, between February 2006 and March 2012, for the presence of new nodules, discovered with sonography. We identified 39 patients (M/F:24/15; mean age of 73.1 years; and age range of 54–91 years) with a total of 44 lesions.

For each patient, we registered the following data: age, date of the MR examination, number of nodules, segmental location of nodules, nodule size, radiologist evaluation for each nodule and the definitive diagnosis. This diagnosis was expressed as 0 to indicate “non-evidence of HCC” and as 1 for “histological confirmation of HCC”.

MRI technique

All MR examinations were performed on a 1.5-T system (Avanto; Siemens Medical Systems, Erlangen, Germany) using a phased-array coil for signal detection. All patients underwent axial T1-weighted and T2-weighted sequences and multiphasic contrast-enhanced dynamic sequences of the whole liver with fat suppression. T1-weighted imaging included a breath-hold in-phase gradient echo sequence (175/5 TR/TE, 256 × 112 matrix, 70° flip angle) and an out-of-phase gradient echo sequence (175/2.38 TR/TE, 256 × 112 matrix, 70° flip angle). T2-weighted imaging included non-fat-suppressed (3945/66 TR/TE, 320 × 195 matrix) and fat-suppressed sequences (4185/53 TR/TE, 320 × 184 matrix). Dynamic sequences were performed with a T1 three-dimensional volumetric breath-hold examination using the following parameters: 4.7/2.3 TR/TE, 256 × 134 matrix, 10° flip angle, 3-mm slice thickness. Gadolinium (Gadobenate Dimeglumine; Multihance, Bracco, Milan, Italy) was injected at a dose of 0.2 mmol/kg at a rate of 2 mL/second, followed by 0.2 mL/kg of normal saline flush, with a second syringe (MedradStellant, Bayer, Germany) at the same injection rate. Arterial phase was acquired using a real-time bolus-tracking method: 9 s after the contrast entered the celiac axis, the liver was imaged. Additionally, a venous and delayed phase was obtained 60 and 180 s after contrast administration. A breath-hold in-phase and out-of-phase T1-weighted sequence (175/2.3 TR/TE, 256 × 134 matrix) and a T1 three-dimensional volumetric breath-hold sequence were performed 2 h after contrast injection (hepatocyte phase).

Image interpretation

MRI studies were analyzed independently by two radiologists with, respectively, 10 and 20 years of experience in liver MRI, and they were independently analyzed by two radiologists with, respectively, 1 and 3 months of experience in liver MRI, using Synapse (PACS, Fujifilm Medical Systems, Japan). Two of these radiologists, with 1 month (IradioLIR) and 10 years (EradioLIR) of experience, evaluated the lesions using the LI-RADS scale v.2018, while the other two, respectively, with 3 months (IradioLik) and 20 years (EradioLik) of experience in liver MRI, evaluated the lesions using the Likert scale (scores 1–5). None of the radiologists were aware of the final diagnosis, official reports or clinical information. They only knew that in each liver, there was at least one lesion to characterize.

IradioLik obtained her 3 months MRI liver experience with EradioLik, without using the LI-RADS scale, while IradioLIR and EradioLIR had been training together for 1 month using the LI-RADS scale for MRI characterization of liver nodules.

LI-RADS scale v.2018

The final category of a liver nodule is determined by the following features: (1) the presence or absence of arterial phase hyper-enhancement; (2) the presence or absence of nonperipheral washout; (3) the presence or absence of an enhancing capsule; (4) the size of the nodule; and (5) the threshold growth of the lesion compared to previous examinations. Certain “ancillary features” and “tie-breaking” rules are applied to adjust category for a total of five categories [15, 16, 19, 20]: LR 1 = definitely benign, LR 2 = probably benign, LR 3 = intermediate probability for HCC, LR 4 = probably HCC, LR 5 = definitely HCC.

Likert scale

Nodular liver lesions were categorized as follows: 1 = HCC is highly unlikely to be present, 2 = HCC is unlikely to be present, 3 = the presence of HCC is equivocal, 4 = HCC is likely to be present and 5 = HCC is highly likely to be present.

According to international guidelines, the diagnosis of HCC was favored by the presence of these criteria: early post-contrast arterial enhancement and portal venous/delayed phase washout, corona enhancement, capsule appearance and threshold growth [13].

Reference standard

Histopathological finding (72.7% of lesions) and MRI follow-up over 4 years (27.3% of lesions) were used as reference standards.

Histopathology was obtained from biopsy (56.8% of lesions) or liver transplantation specimens (15.9% of lesions). At MRI follow-up, a lesion was considered benign when it showed stable or reduced diameters.

Statistical analysis

Before choosing the optimal scale and score for HCC detection, the evaluation of inter-reader agreement was performed by computing linear weighted k coefficients and Pearson correlation coefficients. These analyses allowed us to understand which scale was the most objective and consistent. The k coefficient was interpreted as an indication of poor agreement when k was 0.40 or lower, as an indication of moderate/substantial agreement when k coefficient was higher than 0.40 and lower than 0.80 and as an indication of strong agreement when k coefficient was greater than 0.80. Pearson correlation coefficients were also employed to evaluate the inter-reader agreement and to confirm the results obtained by the k coefficient: Values lower or equal to 0.30 were an indication of poor agreement, values higher than 0.30 and lower or equal to 0.70 indicated moderate agreement, and values higher than 0.70 were evidence of strong inter-reader agreement.

Both for the LI-RADS v.2018 and Likert scales, receiver operating characteristic (ROC) curve analysis was performed to identify, for each evaluator, the optimal score for HCC detection, defined as the score that maximized the evaluator accuracy (ACC). Evaluations from each scale were classified as positive for HCC if the evaluation was equal to or higher than the derived optimal score; otherwise, they were classified as negative for HCC. Analysis of ROC curves allowed the comparison of Likert and LI-RADS v.2018 methods in terms of accuracy, sensitivity and specificity for HCC detection. With this aim, a logistic regression model (using the scores—from 1 to 5—of each evaluator as independent variable and the scores of the gold standard—0 or 1—as dependent variable) was used in each case to derive the probability of each lesion being HCC; fitted data were used to compute ROC curves. Furthermore, for both the LI-RADS and Likert scales, the scores from the two readers were pooled to derive a single optimal score between evaluators using the same scale. To simulate the use of pooled data, for each scale we employed a multiple regression model fitted to the scores of both radiologists as independent variable and the scores of the gold standard as dependent variable. The z test for the difference between two proportions was applied to check the statistically significant difference among the performance ratios, achieved by the readers of the same scale. All p values were two-sided and were considered to indicate a significant difference at p < 0.05. The statistical analysis software was coded in MATLAB.

Results

Since five patients had two lesions to characterize, we evaluated a total of 44 lesions, 26 HCC and 18 non-HCC.

Using the LI-RADS scale, 34 lesions (19 HCC + 15 non-HCC = 77.27% of all the 44 lesions) obtained the same score. The k coefficient between the two evaluators of LI-RADS scale was 0.89, while the estimated Pearson correlation coefficient equaled 0.90.

Using the Likert scale, 22 lesions (11 HCC + 11 non-HCC = 50% of all lesions) were classified with equal scores. The k coefficient and the Pearson correlation coefficient computed to evaluate the Likert scale inter-reader variability were much lower than those computed for the LI-RADS scale; they equaled, respectively, k = 0.69 and Pearson = 0.63.

The ROC curves computed to assess reader performance for HCC detection (Fig. 1) had the following area under the curves (AUCs):

Fig. 1
figure 1

ROC curves for each evaluator and for pooled data both for the LI-RADS and Likert scale

  • Using the LI-RADS scale, the AUCs were 0.87 (EradioLIR), 0.75 (IradioLIR), 0.91 (pooled data);

  • Using the Likert scale, the AUCs were 0.79 (EradioLik), 0.83 (IradioLik), 0.87 (pooled data).

When pooled data were not considered, ROC analysis showed that the optimal threshold criterion, which allowed maximizing both the accuracy as well as the average of sensitivity (SENS) and specificity (SPEC) in the detection of HCC, was a score of 4 or higher for all the evaluators, using either the LI-RADS or the Likert scale. At this optimal score, we calculated accuracy (ACC), sensitivity (SENS), specificity (SPEC), positive predictive value (PPV) and negative predictive value (NPV) (Table 1).

Table 1 Performance values for each observer score

At the optimal score, all the evaluators and pooled data achieved also the maximum average of SENS and SPEC.

For both the scales, the reader performance improved when a cooperative diagnostic procedure was simulated by employing pooled data. Specifically, for the LI-RADS and Likert scales, the optimal pooled scores, which are composed by Eradio and Iradio scores, were, respectively, equal to (3, 2) and (2, 4) (Table 1).

To compare the two scales in terms of the achieved radiologist performances, for each scale the mean of the performance values (ACC, SENS, SPEC, PPV and NPV at the optimal scores, AUC) achieved by the two radiologists and the pooled data were computed.

The computed averages were similar, and the z test did not find any statistically significant difference between them. For the LI-RADS scale, we obtained: ACC = 0.80, SENS = 0.72, SPEC = 0.93, PPV = 0.93, NPV = 0.70, AUC = 0.85; for the Likert scale, the results were: ACC = 0.79, SENS = 0.73, SPEC = 0.87, PPV = 0.89, NPV = 0.70, AUC = 0.83.

The comparability of the achieved results suggests that neither scale is guaranteed to achieve a better performance than the other.

Discussion

MR is a noninvasive diagnostic modality with a high sensitivity (81%) and specificity (85%) for the detection and evaluation of liver nodules in patients with high risk of HCC [10]. When nodules are larger than 15 mm, the specificity becomes higher (100%) [21]. These characteristics make MRI relevant for the management of patients with suspicion of HCC, mostly for the detection of early HCC, usually composed of well-differentiated hepatocytes [22], thereby avoiding more invasive examinations such as fine-needle biopsy (FNB) samples [9, 21].

However, these results are influenced by the experience of the radiologists who examine the images and some clinicians complain of the lack of standardization in reporting liver MRI nodules.

In an effort to improve consistency of interpretations among radiologists, the introduction of a standardized scheme with fixed criteria, already applied for other organs such as breast, thyroid and prostate, has also been proposed for the liver.

The LI-RADS scale was created to standardize the reporting and data collected in patients with cirrhosis or other risk factors for developing HCC, both with CT and MRI, stratifying the risk of malignancy of a nodule with a scale of scores 1–5 (LR 1-LR 5) [13,14,15,16]. In particular, a nodular arterial phase hyper-enhancement is a very important feature, as, if equal or superior to 2 cm or if in association with at least one of these three major features: the presence of nonperipheral “washout”, the presence of an enhancing “capsule” or a threshold growth, is to be considered either probably HCC (LR 4) or definitely HCC (LR 5) [15, 16, 23].

The Likert scale is a quicker and easier method, based on an analysis of items. In our study, we used five items (scores 1–5) to express a positive or a negative opinion to delineate the probability of a liver nodule to be HCC, based on dynamic enhancement and other features such as corona enhancement, capsule appearance and threshold growth [13].

To study the performance of the LI-RADS scale, we compared the diagnostic performance of readers using the latest version of the LI-RADS scale, updated in 2018, with that of readers using the Likert system.

With this aim, gastroenterologists gave us a homogeneous cohort of patients with chronic liver disease. They selected only patients with no previous HCC, with the first discovery of a nodule in the sonography examination, successively studied with MRI.

In both scales, radiologists performed well and accuracy was high.

Visual analysis of the ROC curves computed to assess reader performances for HCC would suggest that the LI-RADS scale provides the better performance values, since the mean AUC achieved by readers employing the LI-RADS scale equals 0.85, while the one achieved by readers employing the Likert scale is 0.83. Moreover, pooling data of readers employing the LI-RADS scale obtains AUC = 0.91, while the AUC achieved by evaluations made by pooled data of Likert scale equals 0.87.

A similar study was completed by Zhan et al. [24], using an earlier LI-RADS version (v2014).

These authors obtained substantial variations in liver observations between reporting by the LI-RADS and Likert methods. Particularly, there was only a slight agreement between the two methods in classification of probable HCC.

Zhang et al. demonstrated differences in diagnostic accuracy between LI-RADS and Likert (accuracy, 78.6% and 87.2%, p < 0.001), but did not evaluate potential differences in inter-reader agreement.

Moreover, they showed substantial discordance between computed tomography (CT) and magnetic resonance (MR) in stratification of hepatic nodules using LI-RADS.

We studied a more homogeneous cohort of patients than Zhang et al. did, because they selected from their database for liver CT or MRI reports in patients suspected of having a hepatic tumor, while we only evaluated subjects who underwent MRI for a nodule detected with sonography.

MRI has a higher per-lesion sensitivity than CT (80 vs. 68%) with a specificity of 94% and should be used as the preferred imaging modality for HCC diagnosis in patients at high risk [25].

MRI performed with a hepatobiliary agent has a significantly greater per-lesion sensitivity (87 vs. 74%) compared with MRI performed with an extracellular agent [26].

Choi et al. [27] demonstrated that LI-RADS v2014 LR-5 on gadoxetate disodium-enhanced MRI exhibited an excellent PPV for the diagnosis of HCC in patients with chronic liver disease and for some authors, it should be incorporated into LI-RADS as a major feature [9, 28, 29].

Evaluating readers using the same scale, both for LI-RADS v.2018 and Likert scale, we found that the differences between the SENS, SPEC and NPV values were statistically significant, but when we compared the performance achieved by employing different scales, we did not find any statistically significant difference between the aforementioned average performances, as obtained by Barth et al. [30].

The two radiologists who used LI-RADS v.2018 scale assigned the same score in 73.07% of HCC lesions and 83.33% of non-HCC lesions. Using Likert scale, the same score was assigned in 42.30% of HCC lesions and in 61.11% of non-HCC lesions.

This resulted in a strong agreement between the two evaluators of LI-RADS v.2018 scale (k coefficient = 0.89), while on the other hand, there was only a moderate agreement between the two evaluators of Likert scale (k coefficient = 0.69). These values, as stated by Barth et al. [30], were also confirmed by Pearson’s correlation coefficient (LI-RADS v.2018 = 0.90; Likert = 0.63). These results suggest that the use of the LI-RADS v.2018 scale for HCC recognition provides a minor inter-reader difference compared with the Likert scale.

Ren et al. [26] compared the diagnostic performance of LI-RADS on MR for diagnosing HCC between v.2017 and v2018, but they did not calculate the inter-reader agreement between the observers as we did to show the reproducibility of the method.

According to Petruzzi et al. [31], the LI-RADS scale produces strong reliability and validity, while aiming to improve the clarity of clinical MRI reports [31]. Our findings confirmed those of Petruzzi et al.

A primary goal and motivation for the development of the LI-RADS scale is to reduce the inter-reader variability.

Clinicians have expressed serious concern over the inconsistency in liver lesion reporting among radiologists owing to the reader’s experience level, differences in interpretation as well as differing personal preferences [32].

In our study, the LI-RADS scale has demonstrated the potential to reduce inter-reader variability and to enhance the communication with referring clinicians, as also observed by other authors [33].

There are several limitations in our study. A primary limitation is that this is a retrospective study with a small sample size of lesions. Second, the radiologists are from the same hospital and may have similar perspectives in terms of interpretation, and third, the experience of the radiologists involved in the study varied greatly. Fourth, we used a mixed reference standard of pathology and MRI follow-up over 4 years. Finally, as a contrast medium, for the evaluation of wash-in and washout and the hepatobiliary phase, we used only gadobenate dimeglumine and not gadoxetate disodium.

Conclusions

According to our results, there was not a statistically significant difference between LI-RADS v.2018 scale and Likert scale in the evaluation of liver nodules and detection of HCC; nevertheless, we suggest the use of LI-RADS v.2018 scale, because it appears more objective and consistent.

Our results reached the goal of LI-RADS which is to improve the consistency of radiology reports in imaging of high-risk patients for HCC.

Additional studies are warranted to evaluate the performance of these scales across radiologists from different institutions with different levels of experience.