Introduction

Hepatic steatosis (HS), the most prevalent liver disease worldwide [1], is a major factor in selecting appropriate donors for living donor liver transplantation [2, 3]. HS increases the risk of graft failure; while macrovesicular steatosis (MaS) greater than 10% is a concern, MaS greater than 30% is an absolute contraindication for donation [4]. In our institution, donors with a MaS of 10–15% and total steatosis (TS, sum of macro- and microvesicular steatosis) of 30% are preferred for right hemiliver donation, but donors with a moderate HS (30–50%) are allowed if they meet all of the following conditions: (1) age ≤ 35 years, (2) estimated ratio of the remnant left hemiliver to total liver volume ≥ 35%, (3) degree of the HS of the left hemiliver less than that of the right hemiliver, (4) sufficient graft-to-recipient weight ratio in the recipient, and (5) recipients unable to wait for weight reduction of the living donor candidates because of medical conditions requiring urgent liver transplantation. Living donor candidates not fulfilling the above criteria are reevaluated after weight reduction. Therefore, in potential liver donors with HS, repetitive assessment of liver fat content is required to monitor the degree of HS during the course of weight reduction.

Although liver biopsy is the reference standard for the HS assessment, its suitability for repetitive examinations is limited because of its invasiveness, low reproducibility, and high cost [5, 6]. Also, as the distribution of fat within the liver is spatially heterogenous [7,8,9], HS assessment using liver biopsy may be erroneous as only small pieces of hepatic tissue are obtained. Recent MRI techniques such as proton density fat fraction (PDFF) are considered accurate for assessing HS [10], but the high cost and limited accessibility of MRI may hamper its widespread use.

CT is widely used for the preoperative evaluation of potential liver donors. Using CT, intrahepatic fat can be quantitatively evaluated by placing a region-of-interest (ROI) and measuring the attenuation values (Hounsfield units; HU) of the liver relative to an internal reference, most commonly the spleen. Although there has been controversy over the performance of CT for HS assessment especially for mild HS [11], recent studies have shown that quantitative CT analysis can accurately assess mild HS and show good agreement with MRI PDFF and pathology [12, 13]. However, such a quantitative method is time-consuming and operator-dependent. Special care is required to reduce measurement error because the sampled attenuation values may be contaminated by environmental density artifact, partial volume averaging, and other reactions that affect the attenuation of the target organs.

The degree of HS can be comprehensively assessed by visually comparing the attenuation of the liver and hepatic vessels, which is also non-invasive and simple to perform. The purpose of this study was to validate the reliability and reproducibility of a visual grading system (VGS) for diagnosing HS in a large cohort of living liver donor candidates.

Materials and methods

Study population

Living liver donor candidates who underwent abdominal CT and liver biopsy for pre-donation workup between January 2013 and December 2014 were consecutively identified. Inclusion criteria were: (a) subjects with abdominal CT scans; (b) at least two biopsy specimens; and (c) less than 3 months interval between CT and liver biopsy. Among 758 subjects initially recruited, 41 were excluded (24 with biopsy–CT interval of > 3 months; 17 without unenhanced CT images). Subjects who fulfilled the inclusion criteria were eligible, regardless of the degree of their HS. The remaining 717 subjects [480 men; median age, 28 years; interquartile range (IQR), 22–35] were finally included. This retrospective study was approved by the institutional review board of our institution and the requirement for written informed consent was waived.

CT protocol

CT scans were performed using a 64 multidetector scanner (Definition, Siemens, Erlangen, Germany). Unenhanced axial images were obtained during a single breath-hold followed by contrast-enhanced images with intravenous administration of 150 mL of iopromide (Ultravist 370; Bayer Schering Pharma, Berlin, Germany). The parameters for the unenhanced CT scans were beam collimation of 64 × 0.6 mm; spiral pitch of 1; gantry rotation time of 0.5 s; tube voltage of 100 kVp; and tube current of 120–200 mAs with automatic exposure control (Care Dose 4D; Siemens). Images were reconstructed with a section thickness of 5 mm at 5 mm intervals. The median interval between CT and biopsy was 9 days (IQR 5–17), and the interval was < 1 month in 87.9% of subjects (648/737).

Hepatic attenuation assessment

The VGS for HS assessment was validated by six reviewers, including three more-experienced radiologists (with 20, 13, and 10 years of experience in liver imaging, respectively) and three less-experienced radiologists (two fellows with 1–2 years of training experience in liver imaging and a 3rd-year radiology resident). All reviewers were blinded to the subjects’ medical histories, previous imaging, and pathologic reports.

A four-point VGS based on comparisons of the brightness of the liver with those of hepatic vessels on unenhanced CT images was developed as follows: grade 0 (G0; hepatic vessels showing lower attenuation than the hepatic parenchyma, with no or minimal margin blurring in less than one-third of the liver), grade 1 (G1; hepatic vessels showing lower attenuation than hepatic parenchyma but with margin blurring in more than one-third of the liver), grade 2 (G2; hepatic vessels showing the same attenuation as hepatic parenchyma), and grade 3 (G3; hepatic vessels showing higher attenuation than hepatic parenchyma; Fig. 1).

Fig. 1
figure 1

Examples of each grade of the VGS for the assessment of hepatic steatosis. a G0, hepatic vessels showing lower attenuation than the hepatic parenchyma, with no or minimal margin blurring in less than one-third of the liver; b G1, hepatic vessels showing lower attenuation than the hepatic parenchyma, but with margin blurring in more than one-third of the liver; c G2, hepatic vessels showing the same attenuation as hepatic parenchyma; and d G3, hepatic vessels show higher attenuation than hepatic parenchyma. VGS visual grading system

Grading assessments were performed in two sessions. The first session served as a training session and consisted of 20 selected cases: six with no pathologic HS (TS < 5%); five with mild pathologic HS (TS 5–30%); five with moderate pathologic HS (TS 30–60%); and four with severe pathologic HS (TS ≥ 60%). These data were prepared by one author not involved in the imaging assessment. Through discussion, a consensus for each case was reached among the six reviewers. After a 2-week washout period, the main testing session was performed using images from all 717 subjects.

Two quantitative indices of liver attenuation were obtained from the unenhanced CT: CTL-S (mean hepatic attenuation minus mean splenic attenuation) and CTL/S (mean hepatic attenuation divided by mean splenic attenuation). Detailed information is provided in the Supplemental Materials.

Liver biopsy and histologic examination

All subjects underwent ultrasound-guided percutaneous liver biopsy by board-certified radiologists (all with > 4 years of experience). Liver tissue was obtained using an 18-gauge needle (TSK Stericut 18G coaxial; TSK Laboratory, Tochigi, Japan). Using the intercostal approach, two or three biopsy specimens, each approximately 1.5 cm in length, were obtained from two different sites in the right hepatic lobe. Specimens were stained with hematoxylin–eosin and Masson trichrome and were examined by board-certified pathologists with > 7 years of experience. TS was evaluated according to the percentage of hepatocytes containing fat granules. The degree of MaS was determined according to the percentage of hepatocytes in which a single large fat droplet or smaller fat droplets occupied the cytoplasm, pushing the nucleus to the periphery.

Statistical analysis

Correlations between imaging-assessed HS (VGS, CTL-S, and CTL/S) and pathologic HS were evaluated using Spearman’s rank correlation coefficient or Kendall’s rank correlation coefficient, as appropriate. The diagnostic performance of each method to detect clinically significant HS (i.e., TS ≥ 30%, MaS ≥ 30%, and MaS ≥ 10%) was evaluated by using receiver operating characteristics (ROC) curve analysis with pathology as the reference. Multi-reader multi-case (MRMC) ROC analysis was used to obtain the average of the six reviewers’ areas under the ROC curves (AUROCs) and their 95% confidence intervals (CIs). AUROCs were compared using DeLong’s method [14]. The performance of the VGS for distinguishing each stage of the TS (TS < 5%, TS 5–30%, TS 30–60%, and TS ≥ 60%) and MaS (MaS < 10%, MaS 10–30%, and MaS ≥ 30%) was evaluated by using the Obuchowski measure, a multinomial version of ROC analysis adopted for ordinal references [15]. Interobserver agreement among the six reviewers for the VGS was assessed using Fleiss κ statistics. Interobserver agreement was compared between the more- and less-experienced reviewer groups using z score and standard errors of the differences. Agreements between pairs of reviewers were assessed using weighted κ statistics. The sensitivity and specificity of the VGS for detecting clinically significant HS were calculated for each reviewer. The corresponding pooled estimates for average-reviewer were calculated using the generalized estimating equation logistic regression to adjust for clustering of the six-reviewer data of each patient. To obtain the sensitivity and specificity of CTL-S and CTL/S, the optimal cutoffs were determined using Youden’s index [16]. Statistical analyses were performed using R version 4.1.1 (R Foundation for Statistical Computing, Vienna, Austria). Two-sided p values < 0.05 were considered significant.

Results

Study population

The characteristics of the 717 subjects are summarized in Table 1. There were 457 (63.7%) subjects with TS < 5%, 239 (33.3%) with TS 5–30%, 17 (2.4%) with TS 30–60%, and four (0.6%) with TS ≥ 60%. For MaS, 634 subjects (88.4%) had MaS < 10%, 66 (9.2%) had MaS 10–30%, 30%, and 17 (2.4%) had MaS ≥ 30%. The median differences in body weight and body mass index between the time of liver biopsy and CT examination were 0.9 kg (IQR, 0.4–1.9 kg) and 0.3 kg/m2 (IQR, 0.1–0.6 kg/m2), respectively, in all study subjects, and 2.1 kg (0.8–5.2 kg) and 0.7 kg/m2 (IQR, 0.3–1.7 kg/m2) in the 69 subjects with a biopsy–CT interval of more than one month.

Table 1 Characteristics of the study population

Correlation between HS assessed by imaging and pathologic HS

The distribution of TS and MaS in each visual grade assigned by each reviewer is shown in Fig. 2. The VGS and pathologic HS (both TS and MaS) showed positive correlations for all six reviewers (p < 0.001; Table E1). For the six reviewers, the mean correlation coefficients between VGS and pathologic HS were 0.543 (range 0.494–0.572; p < 0.001) for TS and 0.6 (range 0.518–0.697; p < 0.001) for MaS. The median values of CTL-S and CTL/S were 9.7 (IQR 5.6–13.5) and 1.2 (IQR 1.1–1.3), respectively. CTL-S and CTL/S showed negative but slightly less correlation with pathologic HS compared to VGS (mean correlation coefficient for TS, − 0.311 for CTL-S and − 0.301 for CTL/S; p < 0.001; mean correlation coefficient for MaS, − 0.451 for CTL-S and − 0.366 for CTL/S; p < 0.001).

Fig. 2
figure 2

Distribution of the visual grade assessments of each reviewer according to total steatosis and macrovesicular steatosis. Cases graded with G0 had TS < 5% (76.1–82.3%), TS 5–30% (17.5–23.7%), and TS ≥ 30% (0–0.3%); those graded as G1 had TS < 5% (20.6–53.5%), TS 5–30% (46.5–79.4%), and TS ≥ 30% (0–1.8%); those graded as G2 had TS < 5% (1.9–11.6%), TS 5–30% (72.1–98.1%), and TS ≥ 30% (0–18.5%); and those graded as G3 had TS < 5% (28.6%–50%), TS 5–30% (30.0–42.9%), or TS ≥ 30% (20.0–28.6%). Regarding macrovesicular steatosis, those graded as G0 had MaS < 10% (98.6%–99.5%) and MaS 10–30% (0.5–1.4%); those graded as G1 had MaS < 10% (75.0–92.2%), MaS 10–30% (7.8–25%), and MaS ≥ 30% (0–0.9%); those graded as G2 had MaS < 10% (20.4–27.9%), MaS 10–30% (60.5–66.7%), and MaS ≥ 30% (10.9–13%); and those graded as G3 had MaS < 10% (5.9–10.0%), MaS 10–30% (21.4–40%), and MaS ≥ 30% (50.0–71.4%). Nearly all G0 and G1 cases had TS < 30% (99.4–99.8%) and MaS < 10% (94.4–96.9%). The majority of G2 and G3 cases had MaS ≥ 10% (77.8–96.7%). The visual grades and pathologic hepatic steatosis (both total and macrovesicular steatosis) showed a positive correlation for all six reviewers (p < 0.001). MaS macrovesicular steatosis, TS total steatosis

Diagnostic performance of the imaging-based method to detect clinically significant HS

All six reviewers showed AUROCs > 0.9 for diagnosing TS ≥ 30%, MaS ≥ 30%, and MaS ≥ 10% (Table E2). The ROC curves and AUROCs of VGS, CTL-S, and CTL/S for diagnosing TS ≥ 30%, MaS ≥ 30%, and MaS ≥ 10% are shown in Fig. E1 and Table 2. In the MRMC analysis, the AUROCs of VGS were 0.921 (95% CI 0.906–0.937) for TS ≥ 30%, 0.975 (95% CI 0.968–0.983) for MaS ≥ 30%, and 0.921 (95% CI 0.911–0.931) for MaS ≥ 10%. The ranges of the AUROCs of the quantitative methods were 0.887–0.890 for TS ≥ 30%, 0.981–0.982 for MaS ≥ 30%, and 0.896–0.902 for MaS ≥ 10%. No differences were noted between the AUROCs of VGS and the quantitative methods (p ≥ 0.1).

Table 2 AUROCs of imaging-based methods for predicting clinically significant HS

The per-reviewer and average-reviewer sensitivity and specificity are shown in Table E3. VGS G2 was the best cutoff (i.e., G0–1 vs. G2–3) for detecting TS ≥ 30% and MaS ≥ 30%, and VGS G1 was the best cutoff (i.e., G0 vs. G1–3) for detecting MaS ≥ 10%. When using the VGS G2 as the cutoff, the average-reviewer sensitivity and specificity for detecting TS ≥ 30% were 0.80 and 0.94, respectively, and those for detecting MaS ≥ 30% were 0.98 and 0.92, respectively. When using the VGS G1 as the cutoff, the average-reviewer sensitivity and specificity for detecting MaS ≥ 10% were 0.93 and 0.81, respectively. Optimal cutoffs were 4.02 (TS ≥ 30%), 1.33 (MaS ≥ 30%), and 3.33 (MaS ≥ 10%) for CTL-S, and 1.05 (TS ≥ 30%), 0.95 (MaS ≥ 30%), and 1.07 (MaS ≥ 10%) for CTL/S. The average-reviewer VGS showed equivalent sensitivity compared to CTL-S and CTL/S except it showed a higher sensitivity than both quantitative parameters in detecting MaS ≥ 10%. The specificity of the average-reviewer VGS was higher than both quantitative parameters in detecting TS ≥ 30% and was equivalent to both quantitative parameters in detecting MaS ≥ 30% and MaS ≥ 10%. Using G1 as a cutoff, all six reviewers showed very few false-negative results when using the VGS to detect TS ≥ 30% and MaS ≥ 10% (Table E4).

For the prediction of TS and MaS, all the reviewers showed a higher Obuchowski measure for predicting TS (range 0.73–0.78) and MaS (range 0.91–0.93) than the quantitative parameters (0.72 for CTL-S and 0.71 for CTL/S in predicting TS and 0.90 for both CTL-S and CTL/S in predicting MaS) (Table E5). The diagnostic performance of each method for detecting TS > 5% is shown in Table E6.

Interobserver agreement of the VGS

The interobserver agreement with consistency and discrepancies among the reviewers are summarized in Table 3. All reviewers showed a substantial agreement (κ = 0.61; 95% CI 0.60–0.63) when using the VGS. All measures of interobserver agreement between two paired reviewers were > 0.75, representing substantial or almost perfect agreement (Table E7). Consistent grading from all reviewers was noted in 60.1% (431/717). More-experienced reviewers showed a higher agreement than less-experienced reviewers (κ, 0.82 and 0.68, respectively, p < 0.001; Table 3). The more-experienced reviewers showed consistent grading for 90.2% (647/717), higher than that achieved by the less-experienced reviewers (74.1%, 531/717; p < 0.001). Discrepancies in the more-experienced group occurred mostly between G0 and G1 (82.8%, 58/70). In the less-experienced group, discrepancies also occurred most frequently between G0 and G1 (77.4%, 144/186), followed by G1 and G2 (21.0%, 39/186).

Table 3 Interobserver agreement, consistency, and discrepancy in the visual grading of HS

Pathology of consistent and discrepant visual gradings

The pathologic results of the consistent and discrepant gradings are listed in Table 4. All G0 and G1 cases that were consistently graded by all reviewers had TS < 30% (100%, 391/391) and mostly had MaS < 10% (98.0%, 383/391). The vast majority of cases graded as G2 or G3 by all reviewers had MaS ≥ 10% (92.5%, 37/40).

Table 4 Pathologic results for each consistent and discrepant category of the visual grading system for HS

Discrepancies in visual grading most commonly occurred between the assignments of G0 and G1 (81.5%, 233/286); pathologically, the vast majority of these cases were categorized as TS < 30% (99.5%, 232/233) and MaS < 10% (97.0%, 226/233). In the more-experienced group, all of the discrepant cases between G0 and G1 had TS < 30% (100%, 58/58), and mostly had MaS < 10% (94.8%, 55/58). Similarly, in the less-experienced group, most of the discrepant cases between G0 and G1 were TS < 30% (99.3%, 143/144) and MaS < 10% (97.2%, 140/144).

Discrepant grading between G1 and G2 occurred for 15.7% (45/286) of all discrepant cases among all reviewers, 11.4% (8/70) in the more-experienced group, and 21.0% (39/186) in the less-experienced group. The subjects with discrepancies in grading between G1 and G2 most commonly had TS of 5–30% (82.2% [37/45] among all six reviewers, 87.5% [7/8] in the more-experienced group, and 79.5% [31/39] in the less-experienced group), and 57.8% of them had MaS ≥ 10% (26/45) among all six reviewers, 62.5% (5/8) in the more-experienced group, and 56.4% (22/39) in the less-experienced group.

Discrepant grading between G2 and G3 was rare, occurring in only 3.1% (6/286) of all discrepant cases for all reviewers, 5.7% (4/70) for more-experienced reviewers, and 1.6% (3/186) for less-experienced reviewers; all subjects had TS of 5–30% and the majority of them had MaS of 10–30% (83.3%, 5/6; 75%, 3/4; and 100%, 3/3, respectively). Discrepant grading among G0 vs. G1 vs. G2 (i.e., G0, G1, and G2 assigned by one or more reviewers in the same case) occurred in two cases (0.3%, 2/717) among all six reviewers, with both subjects categorized as TS < 5% and MaS < 10%.

Discussion

In this study, we demonstrated that the four-point VGS based on visual comparison of hepatic and vessel attenuation on unenhanced CT allows reliable and reproducible assessment of HS. The VGS showed similar performance to the quantitative methods (CTL/S and CTL-S) in diagnosing clinically significant HS. When evaluated by six radiologists with varying degrees of experience, the interobserver agreement was substantial, and most discrepancies did not involve misclassification of clinically significant HS. Therefore, our VGS can be a useful tool for non-invasive, simple, and integrative assessment of HS.

HS assessment by measuring attenuation values within ROIs drawn in liver parenchyma is a quantitative method [11, 17, 18] while VGS is a qualitative method. However, the qualitative nature of the VGS does not necessarily indicate that it is less objective than ROI-based CT attenuation measurement. Assessment of the HU value by placing the ROI on images is also operator-dependent. Density measurements in several areas of the liver can become erroneous by image artifacts [19, 20] as very dense bony structures are located close to the dome and the posterior section of the right hepatic lobe, beam hardening artifact frequently occurs and results in decreased attenuation of the affected area [21], and heterogeneous high- and low-density areas are frequently present in the left lateral section of the liver because of the edge gradient effect caused by pronounced differences in density between liver parenchyma and air in the stomach [21]. Also, similar to the case in liver biopsy, ROI acquisition is limited by sampling bias. The distribution of fat content within the liver is spatially heterogeneous [7,8,9] which is difficult to comprehensively assess by using ROIs that only allow the assessment of the liver in confined areas. Although the sampling error could be resolved by drawing many ROIs or by including the whole liver as an ROI, this would be laborious and time-consuming. Moreover, there is no consensus regarding the cutoff value of hepatic attenuation for determining the presence and degree of HS, which precludes the use of the attenuation value in clinical practice. The aforementioned problems could be largely resolved by visual inspection of the whole liver. By using the VGS, we could comprehensively evaluate the overall density of the whole hepatic parenchyma and could thus be free from the bias of ROI measurement.

For a diagnostic tool to be successfully implemented in clinical practice, it is essential to thoroughly validate its performance according to its purpose [22]. The tool should demonstrate acceptable reliability and reproducibility in capturing the clinically meaningful problems and should be validated in a sufficiently large number of the target population [23]. We demonstrated the reliability of our VGS in a large cohort (n = 717) by showing positive correlations between visual grades and pathologic HS of the liver (p < 0.001), and high AUROCs for assessing clinically significant HS (> 0.9). The performance of the VGS was similar to those of quantitative assessments, and nearly all cases of G0–G1 had a TS < 30% (99.4–99.8%) and a MaS < 10% (94.4–96.9%).

The main focus of this study was to demonstrate the reliability and reproducibility of the VGS as a simple and comprehensive method for HS assessment, rather than to show that the VGS is superior to the standard quantitative parameters. We consider that both the VGS and the quantitative parameters could be used in a complementary manner to improve the accuracy of CT-based HS assessment (for example, if the decision is difficult with the VGS alone, the reader could measure the HU of liver, or vice versa), and this should be further investigated in future studies.

We demonstrated the reproducibility of the VGS by showing a substantial to almost perfect interobserver agreement among the six reviewers with a wide range of clinical experience (3rd-year resident to an expert in liver imaging with > 20 years of experience). Although the interobserver agreement differed according to the level of experience, over 75% of cases with discrepant grading were on the decisions between the two lowest grades, and the vast majority of the subjects with inter-observer discrepancy (> 94%) had pathologic TS < 30% and MaS < 10%. Therefore, most interobserver discrepancies seem not related to misclassifications in the assessment of significant HS from a clinical point of view, which again suggests that our VGS has an acceptable reproducibility and reliability across clinicians with a wide range of experience.

Several studies have demonstrated that weight loss achieved through diet and exercise can significantly reduce HS in potential living liver donors [24,25,26]. Although liver biopsy is a current reference standard for the evaluation of HS, given its drawbacks and the necessity for a repetitive assessment of liver fat content during weight loss, the feasibility of liver biopsy in potential liver donors with HS is questionable. Using the VGS with G1 as a cutoff (i.e., G0 vs. G1–3), all reviewers showed very few false-negative results to detect TS ≥ 30% and MaS ≥ 10%. This suggests the potential utility of the VGS as a screening tool to guide decision-making for liver biopsy: selective biopsy may be performed for those with a VGS of G1 or higher, in whom the possibility of clinically significant HS would not be excluded.

Our study has several limitations. First, the reference standard for HS was the liver biopsy, which may be subject to sampling bias. As a recent study has suggested high accuracy of MRI techniques such as PDFF in quantifying HS [10], comparisons between CT and MRI would result in further conclusive results, which we are planning as a subsequent study. Second, the pathologic HS in our study population was skewed towards a milder degree of HS which reflected the characteristics of the living liver donor candidates encountered in a real-world setting. It should be noted that when applying the VGS in a population with substantial discrepancies in the prevalence and profile of HS compared with our study population, the VGS could yield more false-positive or false-negative results than in our study. In such a situation, liver biopsy should remain as the reference standard. Third, the interval between CT and liver biopsy was relatively long (up to three months). However, over 90% of subjects (648/717) had an interval of < 1 month. Lastly, the VGS has limited ability to discriminate between HS and steatohepatitis, and between nonalcoholic fatty liver disease and nonalcoholic steatohepatitis, which are inherent limitations of CT scans.

In conclusion, our VGS developed for the visual assessment of HS on unenhanced CT images showed acceptable reliability and reproducibility for diagnosing clinically significant HS, especially MaS, in a large population of potential liver donors. Our VGS could be useful for the non-invasive assessment of HS in potential living liver donors during pre-donation workup.