Hepatocellular carcinoma (HCC) is the most common primary malignant tumor of the liver, the fifth and seventh most common cancer in men and women, respectively. It is the second leading cause of deaths related to cancer worldwide [1].

Unlike many other cancers, under certain conditions HCC can be diagnosed with imaging, without the need for pathology confirmation in many cases. Several diagnostic systems advocate using contrast-enhanced ultrasound, computed tomography (CT) or magnetic resonance imaging (MRI) for the non-invasive diagnosis of HCC [2,3,4,5]. The Liver Imaging Reporting and Data System (LI-RADS) has been developed for the standardized interpretation, reporting, and data collection of liver imaging examinations in patients at risk for HCC [3, 6]. LI-RADS assigns categories to liver observations that reflect their probability of HCC. The development of LI-RADS is based on a combination of evidence, expert opinion, and the desire to maintain congruency with American Association for the Study of Liver Diseases (AASLD) and United Network for Organ Sharing-Organ Procurement and Transplantation Network (UNOS-OPTN) guidelines [4, 7, 8]. However, there is a need to validate the diagnostic performance of LI-RADS categories, major and ancillary imaging features. Prior studies have focused on the reproducibility of LI-RADS categories and major features [9,10,11]. Some studies have reported the diagnostic performance of LI-RADS categories [10,11,12,13], major features [12, 13], and ancillary features [11,12,13] in earlier versions (2011 or 2014).

LI-RADS has been updated in 2017 based on evidence from recent studies [3], expert consensus and multidisciplinary feedback. The 2017 version includes revised categories, explicit criteria for LR-M, updated definitions and clarifications on the use of ancillary features. LI-RADS is presently the only diagnostic system that allows the use of ancillary features to adjust the probability of HCC. Some of these ancillary features favor malignancy in general, some favor HCC in particular, and others favor benignity. Radiologists can choose to apply ancillary features to readjust the category by upgrading or downgrading the LI-RADS category initially assigned based on major features only. CT, MRI, and contrast-enhanced ultrasound may be used for diagnosis of HCC and categorization of liver observations using major and ancillary features. CT scans are widely available, less prone to artifacts, less time consuming, have a lower cost, and also require less experience for interpretation compared to MRI [14]. High spatial resolution and isotropic acquisitions are suitable for multiplanar reconstructions [15]. For evaluation of focal liver lesions, CT has a sensitivity and specificity similar to that of MRI according to recent meta-analyses [16, 17].

Therefore, the primary purpose of this study was to evaluate the diagnostic performance of LI-RADS version 2017 major features, ancillary features, and categories on CT for the diagnosis of HCC. The secondary purpose was to evaluate the interreader agreement.

Materials and methods

Study design and subjects

This retrospective, cross-sectional, single-site study was approved by the institutional review board at the Centre Hospitalier de l’Université de Montréal (CHUM). Patient consent was waived. The study flowchart is illustrated in Fig. 1. All eligible patients were identified using the inclusion and exclusion criteria listed in Table 1.

Fig. 1
figure 1

Flowchart of patient selection. HCC hepatocellular carcinoma

Table 1 Inclusion and exclusion criteria

Characteristics of patients and observations

Between January 2013 and September 2016, 59 patients (104 observations including 72 HCCs) met the inclusion criteria. Characteristics of patients, observations, LR categories, final diagnosis, and time interval between index CT imaging and reference standards are summarized in Table 2. In our study, the mean diameter for all observations was 30.1 mm ± 23.4 (range, 6–110 mm). The mean diameter of HCCs evaluated was 34.4 mm ± 23.4 (range, 9–110 mm). Our study included a total of 104 observations. The average number of observations per patient was 1.8 ± 1.3 (range, 1–10).

Table 2 Characteristics of patients, observations, reference standards, and index tests included in study cohort

CT imaging protocols

All CT scans were performed with various multidetector CT (MDCT) scanners available at our institution (16 and 64 MDCT scanners). An unenhanced phase—which is optional according to the LI-RADS technical recommendations—was performed in some patients. All CT examinations included at least three vascular phases covering the upper abdomen CT technique. Details of the CT protocol are provided in Supplementary Table 1.

Observation registry

The CT performed for characterization of one or more observation was chosen as the index test. One reader (A.A., 9 years of experience in radiology) and reported diameter and location according to Couinaud classification of liver segments for each observation (1–10 maximum per patient) and provided a list for review. The previous examination (ultrasound, CT, or MRI) was used to evaluate for visibility as discrete nodule and growth when applicable.

CT review

Two senior abdominal radiologists with experience in liver imaging (D.O. and J.S.B., with 21 and 23 years of experience, respectively) independently reviewed all the CT examinations on picture archiving and communication system workstations (Impax version 6.6, Agfa HealthCare, Mortsel, Belgium) using the size and localization of observations as a guide. There were no annotations or measurements on images at the time of review. Information about observations’ growth or ultrasound visibility as discrete nodule was provided to the reviewers at the time of review. Assessment of major and ancillary features was conducted according to the LI-RADS version 2017 [3]. Images were bookmarked by the author and used as a guide.

Blinding

Both readers were aware of the design and purpose of the study, but were blinded to the clinical history, imaging report, and final diagnosis.

Image analysis

Each reader independently assessed each major and ancillary features for all bookmarked observations, except for diameter, threshold growth, ultrasound visibility as discrete nodule, and subthreshold growth which were assessed by the reader (A.A., 9 years of experience in radiology) who bookmarked all observations.

Each reader separately assigned a LI-RADS category that reflects the probability of HCC: LR-NC for observations that cannot be categorized due to image degradation or omission, LR-1 for definitely benign, LR-2 for probably benign, LR-3 for intermediate probability of HCC, LR-4 for probably HCC, LR-5 for definitely HCC, LR-TIV for tumor in vein, and LR-M for probably or definitely malignant but not specific for HCC [3]. Categorization was performed in two steps process: first, using LI-RADS major imaging features only; then, after evaluation and application of ancillary features. When a feature could not be assessed (for example: blood products in mass cannot be evaluated in a patient without an unenhanced phase) then it was considered as non-applicable. Each reader was given the choice to adjust (upgrade or downgrade) the final category for each observation based on the presence of ancillary features. Readers were not allowed to upgrade from LR-4 to LR-5 category as defined by LI-RADS. Disagreements on major features, ancillary features, and LI-RADS categories were solved by consensus at least 2 weeks after the initial interpretation.

Reference standard

The unit of analysis was per lesion. A composite reference standard was established based on (a) histopathological confirmation for 43 observations, including image-guided biopsy specimen (n = 16), hepatectomy (n = 14), and explant (n = 13); (b) imaging for 61 observations, including follow-up imaging demonstrating significant growth (≥ 50% diameter increase in less than 6 months or ≥ 100% diameter increase in more than 6 months), recurrence after local treatment or new lesion demonstrated by MRI (n = 52), CT (n = 3), contrast-enhanced ultrasound (CEUS) (n = 2), MRI demonstrating typical features of HCC, or ethiodized oil (lipiodol; Lipiodol Ultra-fluid, Guerbet, Villepinte, France) fixation within the tumor (n = 4). The assessor of the reference standard (A.A., 9 years of experience in radiology) had access to clinical history, index and previous imaging, and pathological information but did not participate in the assessment of the index test. Since non-malignant lesions are rarely assessed by pathological analysis, the reference standards for LR-1, and LR-2 observations were based on typical imaging findings on another imaging modality or the absence of progression to a higher category (LR-4, LR-5) during the follow-up. The reference standard for LR-3 observations was follow-up imaging except when pathology was available.

Statistical analysis

Statistical analyses were supervised by a biostatistician (P.B.C., 7 years of experience) and performed by a data scientist (K.B., 8 years of experience) with statistical software (SAS version 9.4; SAS Institute Inc., Cary, NC). Continuous variables were expressed as mean ± standard deviation (SD), minimum and maximum. Categorical variables were expressed as numbers and percentages.

Diagnostic performance of imaging features.—Estimates of diagnostic performance (including sensitivity, specificity, accuracy, positive predictive value, negative predictive value) of major and ancillary features were calculated separately for each reader (R1 and R2) and also for the consensus interpretation. A generalized estimating equation (GEE) model based on a binomial distribution with an identity link function and an exchangeable correlation matrix was used to compute estimates of diagnostic performance while taking into account patient cluster effect due to the presence of several observations in the same patients. Diagnostic odds ratios were computed by using a GEE model based on a binomial distribution with a logit link function and an exchangeable correlation matrix. The point estimates and 95% confidence (CI) intervals were reported.

Diagnostic performance of LI-RADS categories.—The diagnostic performance of LI-RADS categories was reported for each category before and after applying ancillary features. The diagnostic performance was also reported for the following combinations (LR-5 + LR-TIV) and (LR-4, LR-5 + LR-TIV) before and after applying ancillary features. The diagnostic performance was calculated for each reader separately (R1 and R2) and also for the consensus interpretation. Similarly, GEE models were used to account for patient cluster effect during computation of point estimates and 95% CI.

Impact of ancillary features.—The number of observations in which ancillary features upgraded or downgraded the final LI-RADS category was documented.

Interreader agreement.—Cohen’s kappa with 95% confidence intervals was used to determine interreader agreement. The 95% bias-corrected bootstrap confidence intervals using 1000 bootstrap samples were computed. Agreement was interpreted as slight (0.01–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), or almost perfect (0.81–1.00) [18]. Considering the observed marginal distribution, the inter-reader agreement for major features, ancillary features, LI-RADS categories according to major features only and combination of major and ancillary features were documented also with proportion of agreement. Inter-reader agreement could not be computed for some features because of their low frequency.

Results

Diagnostic performance of major features

Table 3 summarizes the per-lesion diagnostic performance of major features. Sensitivity of major features for HCC ranged from 20.7% to 86.0% for consensus. Diameter threshold had the highest sensitivity of 98.7% followed by arterial phase hyperenhancement. Capsule appearance had the lowest sensitivity. Specificity ranged from 39.9% to 89.9% and was highest for enhancing capsule. Observation size more than 10 mm had the lowest specificity of 1.3%. Positive predictive value ranged from 50.0% to 90.5% and was highest for wash-out. Figure 2 illustrates an example of HCC diagnosed based on major features.

Table 3 Per-lesion diagnostic performance of major features according to LI-RADS v2017 algorithm
Fig. 2
figure 2

Axial CT images in an 81-year-old man with cirrhosis and an 11-cm liver mass in segments II and III. A Images obtained before and after contrast enhancement in (B) late arterial phase shows arterial phase hyperenhancement (arrow) of the observation, C portal venous, and D delayed phases shows washout appearance (arrow) and mosaic architecture (*) of the mass. Observation was categorized LR-5 according to major features and hepatocellular carcinoma diagnosis was confirmed by pathology after hepatectomy

Diagnostic performance of ancillary features

Table 4 summarizes the per-lesion diagnostic performance of ancillary features favoring malignancy in general or HCC in particular. Sensitivity ranged from 1.3% to 50.8% and was highest for subthreshold growth. Specificity ranged from 66.9% to 96.4% and was highest for fat in mass more than adjacent liver. Positive predictive value ranged from 78.4% to 90.3% and was highest for fat in mass more than surrounding liver. Figure 3 illustrates an example of an HCC diagnosed based on major features in combination with ancillary features.

Table 4 Per-lesion diagnostic performance of ancillary features in favor of malignancy in general and hcc in particular
Fig. 3
figure 3

Axial CT and MR images in a 76-year-old man with cirrhosis and an 18-mm liver observation in segment VI. A Axial CT images obtained before and after contrast administration in (B) late arterial, C portal venous, and D delayed phases shows intra-lesional fat (arrowhead), and subtle washout appearance at the posterior part of the mass (arrow). The observation was categorized LR-3 according to major features, and upgraded to LR-4 category based on the presence of fat in mass, more than adjacent liver which is an ancillary feature in favor of malignancy. E Axial MRI gradient-recalled echo images in-phase, and F out-of-phase confirmed fat content by signal drop on out-of-phase imaging (arrowheads). Hepatocellular carcinoma diagnosis was confirmed by pathology after biopsy

Diagnostic performance of LI-RADS categories

The diagnostic performances of LI-RADS categories before and after applying ancillary features are summarized in Supplementary Tables 2 and 3. The sensitivity of LR-5 or LR-TIV was 65.5%. The sensitivity of the combination of LR-4, LR-5, or LR-TIV was 94.7% with major features only and 96% according to major and ancillary features. The specificity of LR-5 or LR-TIV was 91.3%. The specificity of LR-4, LR-5, or LR-TIV was 75.2% with major features only and 71.3% according to major and ancillary features. Figure 4 illustrates the proportion of HCC for each LI-RADS category.

Fig. 4
figure 4

Graphic illustrating the proportion of hepatocellular carcinomas for each LI-RADS category. HCC hepatocellular carcinoma. N nominator, D denominator

Ancillary features impact on observation categories

The numbers and percentages of observation categories before and after application of ancillary features are illustrated in Figure 5. The observation category before and after applying ancillary features was identical in 96.2% of observations (100 of 104). When taking into account the rules determined by LI-RADS v2017, ancillary features modified the category in 3.8% of all observations (4 of 104). The observation category was upgraded in 8.7% (2 of 23 observations eligible for category upgrade) and downgraded in 4.1% (2 of 49 observations eligible for category downgrade). The two observations categorized LR-3 based on major features and upgraded to LR-4 based on ancillary features were a 10-mm HCC with subthreshold growth and a 16-mm hemangioma with suspicion of intra-lesional fat. The two observations categorized LR-3 based on major features and downgraded to LR-2 based on ancillary features were two 11-mm hemangiomas that had enhancement that paralleled blood pool.

Fig. 5
figure 5

Diagram shows numbers and percentages of observation before and after application of ancillary features according to Liver Imaging Reporting and Data System (LI-RADS) version 2017 application rules. Color coding of each LI-RADS category is identical to color convention used in official LI-RADS documents. N/A indicates category changes that are not permitted by LI-RADS version 2017 rules [3]

Out of the 27 observations categorized LR-4 based on major features, 13 cases could not be upgraded to LR-5 (to respect the LI-RADS rule that a LR-4 observation cannot be upgraded to LR-5) and remained LR-4 despite the presence of subthreshold growth (7), US visibility as discrete nodule (9), corona enhancement (1), fat sparing in solid mass (1), and fat in mass, more than liver (2).

Inter-reader agreement

Supplementary Table 4 summarizes interreader kappa and observed agreement for major features, ancillary features, and LI-RADS categories. Agreement was not calculated for diameter, threshold growth, ultrasound visibility as discrete nodule, and subthreshold growth because those were bookmarked by the author and provided to both readers during reading sessions. Kappa and observed agreement could not be calculated for fat sparing in solid mass, non-enhancing capsule, nodule-in-nodule, and blood product in mass because of perfect agreement.

Inter-reader agreement for major features was high for arterial phase hyperenhancement (κ = 0.90, proportion of observed agreement = 0.97) and washout (κ = 0.93, proportion of observed agreement = 0.97). Agreement was substantial for capsule (κ = 0.66, proportion of observed agreement = 0.89). Inter-reader agreement for ancillary features was fair to substantial (κ = 0.66–0.85, proportion of observed agreement = 0.93–0.99). Inter-reader agreement was high for LI-RADS categories assigned based on major features only (κ = 0.88, proportion of observed agreement = 0.91), and when assigned according to major features in combination with ancillary features (κ = 0.90, proportion of observed agreement = 0.92).

Discussion

For major features, we confirmed that larger observation diameter is a predictor for HCC diagnosis as described in prior studies with different size stratification thresholds [11, 19, 20]. However, the specificity for observation diameter ≥ 10 mm and ≥ 20 mm was lower than the pooled specificity of 86%-90% for observation diameter 10–20 mm and diameter > 20 mm reported in the meta-analysis by Chou et al [16]. Although the high sensitivity of arterial phase hyperenhancement was within the range previously reported in the literature (65%–87%) [21,22,23], the specificity was lower than previously reported (81%) on CT by Sangiovanni et al [23] for ≤ 1 cm observations detected during ultrasound surveillance in patients with cirrhosis. This may be explained by the inclusion of non-ultrasound-detected observations in our study. Wash-out appearance as a standalone feature had a higher sensitivity, but lower specificity than previously reported [23]. The combination of arterial phase hyperenhancement and washout—a criteria used in other systems for the diagnosis of HCC [5, 8, 24]—provided a specificity within the range (66%–100%) reported in the literature [23, 25,26,27]. In our cohort, enhancing capsule has the highest specificity for the diagnosis of HCC, within the range (86%–96%) reported in previous MRI studies [28, 29].

Few studies have assessed the accuracy of ancillary features on CT, alone or in combination with major features, for diagnosis of HCC [13]. Among ancillary features favoring malignancy in general or HCC in particular, subthreshold growth and ultrasound visibility as discrete nodule had the highest sensitivity, whereas fat in mass more than adjacent liver had the highest specificity. Diagnostic performance could not be computed for corona enhancement, fat sparing in solid mass, non-enhancing capsule, nodule-in-nodule, mosaic architecture and blood product in mass because these features were rarely or never seen in our cohort. This is in line with a recent observation by Fowler et al that ancillary features were less commonly observed on CT than on MRI (26% and 74% respectively) [9]. Ultrasound visibility as discrete nodule, which is an ancillary feature added in the updated version of LI-RADS v2017, was the most frequently encountered feature with 27 observations seen on US, 24 of which were HCCs. In contrast, non-enhancing capsule which is also a new ancillary feature favoring HCC in particular in the 2017 version of LI-RADS was never encountered in our study. A subtype of non-enhancing capsule has been described in the literature as a small hypointense rim on hepato-biliary phase MRI but seldom described on CT [30,31,32].

The present study confirms the high specificity and positive predictive value of LI-RADS categories LR-5 and LR-TIV for the diagnosis of HCC. Our results are similar to those reported by Ronot et al who reported a specificity of 88.3% and a positive predictive value of 90.9% for the LR-5 and LR-5V combination based on LI-RADS v2014 [13]. Combining the LR-4, LR-5, and LR-TIV categories increases the sensitivity but with a decrease in specificity. The combination of (LR-4, LR-5 and LR-TIV) categories for the diagnosis of HCC remains controversial in the literature. Ronot et al. reported an increase in sensitivity (from 69.6% to 85.8%), and a significant decrease in specificity (from 88.3% to 66.0%) similar to our results. Others reported a marked increase in sensitivity with no loss in specificity, but those were limited to MRI [33, 34]. Application of ancillary features provided similar sensitivity and specificity for individual categories (e.g., LR-3 or LR-4) or combinations of categories (e.g., LR-4, LR-5, and LR-TIV). This may be explained by the LI-RADS v2017 application rules that specify that category upgrade or downgrade is optional, at the discretion of radiologists.

For major features, inter-reader agreement was substantial to almost perfect, highest for washout followed by arterial phase hyperenhancement, both higher than previously reported (κ = 0.61–0.84 and 0.54–0.86, respectively) in the literature [9, 12, 34,35,36]. This could be explained by the fact that our study was limited to CT and both readers had similar experience. For ancillary features, agreement was high for corona enhancement, substantial for intra-lesional fat and moderate for mosaic architecture. Kappa could not be calculated for fat sparing in solid mass, non-enhancing capsule and blood product in mass because of perfect agreement. For LI-RADS categories, agreement was almost perfect before and after applying ancillary features.

Our study had some limitations. First, it was a retrospective study so selection bias may affect generalizability. However, selection criteria were defined to respect LI-RADS technical requirements, the referral pattern at our tertiary care center, and the requirement for a reference standard. Second, the exclusion criteria —which were required to exclude patients with a known history of HCC, technically inadequate examinations such as unenhanced CT, and absence of contemporaneous reference standard— decreased the patient sample size. Nevertheless, the per-lesion unit of analysis included 104 observations, a sample size similar to that of prior diagnostic CT studies [23, 25,26,27, 37]. Fourth, the application of ancillary features to adjust the observation category was left at the discretion of radiologists, as stated in the LI-RADS application rules for ancillary features. Hence, decisions to adjust observation category were subjective. Future research studying the thought process leading to category adjustment is needed. Finally, the reference standard for the final diagnosis of HCC not based on a histopathological diagnosis alone but on a composite reference standard. Restricting our eligibility criteria to patients with pathologically confirmed specimens only would have introduced a confirmation bias. This approach is congruent with current clinical guidelines that do not mandate pathological proof for observations that meet diagnostic imaging criteria.

In conclusion, our CT study confirms that LI-RADS major features have moderate to excellent specificity and LR-5 category has a near-perfect specificity for the diagnosis of HCC. Use of ancillary features on CT at the discretion of the radiologist modifies the final category in a small number of cases, but provides similar sensitivity and specificity for the diagnosis of HCC. Combining the LR-4, LR-5, and LR-TIV categories increases the sensitivity but with a decrease in specificity. Interreader agreement was substantial to almost perfect for major features, moderate to almost perfect for ancillary features, and almost perfect for categories.