Introduction

Nonalcoholic fatty liver disease (NAFLD) is characterized by excess hepatic fat in the absence of significant alcohol consumption and alternate etiologies for hepatic steatosis [1]. It has burgeoned to become the most common liver disease, with an estimated global prevalence of 25% [2]. Nearly 20% of all NAFLD patients suffer from nonalcoholic steatohepatitis (NASH), characterized by hepatocyte injury and an increased risk for liver cirrhosis, hepatocellular carcinoma, liver failure, and cardiovascular death [2]. Although one-third of all Asians are estimated to suffer from NAFLD, it remains under-recognized, and less than 10% of diagnosed cases are referred to a specialist [3,4,5].

Patients with NASH, particularly those with advanced fibrosis (AF) ≥ F3, are at risk of adverse outcomes. Therefore, this sub-population needs referral and management by a specialist. Liver biopsy remains the gold standard for identifying NASH and fibrosis. However, the invasive nature, sampling variability, inter- and intra-observer reporting variability, and associated risks and costs prohibit its widespread use in NAFLD. Non-invasive tests (NITs) for the assessment of fibrosis include biomarker-based NITs like NAFLD fibrosis score (NFS), fibrosis-4 index (FIB4), aspartate aminotransferase (AST)-platelet ratio index (APRI), AST-alanine aminotransferase (ALT) ratio (AAR), enhanced liver fibrosis (ELF), and body mass index-AAR-diabetes mellitus (BARD) score, or mechanical liver stiffness using magnetic resonance-based or ultrasound-based transient elastography (TE) techniques [6]. Biomarker results are influenced by factors including age, comorbidities, the prevalence of fibrosis, and NASH [7]. Moreover, most NITs have been developed and validated predominantly in Caucasians [8]. Reports from Asia and other geographic regions suggest poor diagnostic accuracy of biomarker-based NITs [9,10,11]. Therefore, the present study was designed to evaluate the diagnostic accuracy of NITs for identifying AF in biopsy-proven NAFLD patients. We aimed to identify optimal thresholds for the Asian population and assess the role of two-step screening with a blood-based biomarker followed by TE in resource-limited settings.

Patients and Methods

We invited authors with prior publications/experience in managing NAFLD from South Asia to be part of this study. A total of 9 centers were invited to contribute data, of whom 6 agreed to participate. Two centers did not include any patients with advanced fibrosis (Medanta Medicity, Gurgaon, Haryana, India and SCB Hospital, Cuttack, Orissa, India), while another center had not performed TE for their patients (First Affiliated Hospital of Fujian Medical University, Fuzhou, Fujian, China), their details are provided in Fig. 1.

Fig. 1
figure 1

Patient flow in the study cohort

Finally, we retrospectively analyzed data from three centers which included four cohorts—three from India and one from Singapore. Each cohort consisted of a prospectively maintained database of consecutive adult patients (age > 18 years) with biopsy-proven NAFLD. The first cohort (All India Institute of Medical Sciences, AIIMS, liver clinic cohort) comprised patients recruited from a gastroenterology out-patient clinic, and the second cohort consisted of patients referred to the surgical department for bariatric surgery at the same center (AIIMS bariatric cohort). The third cohort included patients following up in the hepatology department at the Post Graduate Institute of Medical Education and Research (PGIMER), Chandigarh, India, while the fourth cohort included patients following up at Singapore General Hospital, Singapore. Investigations to exclude other causes of liver disease were performed, which included hepatitis-B surface-antigen, anti-hepatitis C antibody, anti-nuclear antibody, anti-smooth muscle antibody, and anti-liver-kidney-microsomal antibody and serum ceruloplasmin. Any patients who consumed alcohol in amounts beyond the cut-off recommendation for NAFLD (14 units/week for females and 21 units/week for males) were excluded [12]. Additionally, any patient with human immunodeficiency virus (HIV) infection, pregnancy, heart failure, ascites, active malignancy, valvular heart disease, or liver transplant were excluded.

Liver Biopsy and Fibrosis

An ultrasound-guided percutaneous plugged liver biopsy was performed by treating physician or interventional radiologist at each individual center. In patients undergoing bariatric surgery (AIIMS bariatric cohort), liver biopsy was performed intraoperatively from the left lobe using a 16G Trucut biopsy needle (Bard, Maxcore). The biopsy specimens were fixed using 10% buffered neutral formalin, processed, and paraffin-embedded. Hematoxylin–eosin staining along with Masson’s trichrome was used to stain sections of 4 microns thickness which were evaluated by expert pathologists at each center blinded to clinical and laboratory parameters. Samples were considered adequate if atleast 6 portal triads could be identified. NASH Clinical Research Network (CRN) scoring system was used to assess the stage of fibrosis, and NASH activity score > 5 points to define NASH [13]. Fibrosis seen in NAFLD patients was graded from F0 to F4. F0 represents the absence of fibrosis, F1 includes periportal or perisinusoidal fibrosis, F2 perisinusoidal and portal/periportal, F3 bridging fibrosis, and F4 cirrhosis [13]. AF includes F3 and F4 stages of fibrosis and is the key prognostic indicator for liver-related events, overall mortality, and occurrence of HCC in NAFLD. [14]

Non-invasive Tests

All patients underwent TE and blood investigations. TE was performed after overnight fasting using FibroScan touch 502 (Echosens, France) at each center by trained operators blinded to clinical and histological details. TE provides two outputs: liver stiffness measurement (LSM), which represents tissue stiffness and fibrosis, and the controlled attenuation parameter (CAP), which represents steatosis. Measurements were taken from right lobe of liver using a window in the intercostal space with the patient in supine position and the right arm abducted. The “M-probe” was used for those with BMI < 30 kg/m2 and the “XL-probe” for BMI > 30 kg/m2. [15] Ten valid measurements were taken for LSM and CAP and values reported as median with interquartile range (IQR), with measurements considered reliable if IQR/median was < 0.3.

Blood investigations included hemogram, liver, and renal biochemical tests, lipid profile, fasting blood glucose, and insulin levels. These were used to calculate the five established blood-based NITs: NFS [9], FIB4 [16], BARD score [17], APRI [18], and AAR (Supplementary Table 1). Agile 3+ score is a recently validated score for diagnosis of AF in NAFLD that utilizes a combination of TE and blood tests, and was included in the primary analysis. A similar score (FAST score) has been used to diagnose fibrotic NASH (i.e., NAS score ≥ 4 and significant fibrosis ≥ F2)—a concept distinct from AF. All blood investigations were performed within 4 weeks of TE.

Statistical Analysis

We limited the analysis to those patients from the cohorts who had data available for estimating each of the five blood-based tests as well as LSM by TE. The qualitative data were presented as proportions and compared using the Chi-square test. The quantitative data were assessed for normality using the Shapiro–Wilk test and presented as mean (standard deviation, SD) or median (interquartile range, IQR) as appropriate. Continuous data were compared using Students’ t-test for the mean (SD) or Mann–Whitney U test for median (IQR). The diagnostic accuracy of each NIT was assessed using the area under the receiver operating characteristic curve (AUROC). Two diagnostic cut-offs corresponding to the 90% sensitivity and specificity thresholds for AF were calculated to form the rule-out and rule-in criteria for each test. We also estimated sensitivity and specificity at thresholds recommended by professional hepatology societies. Univariate analysis and multivariate logistic regression were performed using a regression model to assess factors predicting AF. AUROC for the available NITs was compared using the DeLong method. For deriving a combination of two tests in series with the best accuracy, we calculated the proportions of patients who were correctly classified (combining true positives (TP) and true negatives (TN)), incorrectly classified (False negatives [FN] and False Positives [FP]), or indeterminate among total cases. A p value of < 0.05 was considered significant. Analysis was performed using Stata v14 [Stata Corp, Texas].

Results

A total of 641 patients were available for analysis (323 patients from two cohorts from New Delhi, India, 178 patients from Chandigarh, India, and 140 patients from Singapore) (Fig. 1). Baseline clinical, comorbidity, and laboratory parameters along with NIT measurement, stratified by the center, are shown in Table 1. The study population comprised 55.2% males, mean (± SD) age 42.9 ± 12.8 years, and mean BMI 30.5 ± 8.55 kg/m2. The AIIMS bariatric cohort expectedly had significantly greater median BMI compared to other cohorts, (42.7 (37.9–47.9) kg/m2 vs 26.6 (24.2 vs 29.8) kg/m2, p < 0.001), while the Singapore cohort included older patients (54.2 ± 11.1 vs 39.8 ± 10.5 years, p < 0.001). On liver biopsy, the proportion of patients with fibrosis grades F0, F1, F2, F3, and F4 were 30%, 31.4%, 13.9%, 15.4%, and 9.4%, respectively, thus there were 159 patients (24.8%) with AF (≥ F3 fibrosis). Histological evidence of NASH was present in 28.4% of patients, with a higher prevalence among patients with AF compared to F0-F2 fibrosis (53.3% vs 15.7%, p < 0.001).

Table 1 Clinical, laboratory parameters, and non-invasive tests (transient elastography, biomarker based, and combination tests) across all cohorts (N = 641)

Diagnostic Accuracy of Existing NITs in Identifying Fibrosis

We measured AUROC for all tests in identifying AF in each cohort (Table 2) Overall, the diagnostic performance of LSM was superior to all blood biomarker-based NITs for the detection of AF (p < 0.001, Delong test), and most biomarker-based tests had poor diagnostic accuracy with an AUROC < 0.7 (Fig. 2). Diagnostic accuracy was high (> 0.8) for only LSM, and Agile3+ score—a combination score that utilizes LSM values. On pairwise comparison (Supplementary Table 2), LSM was superior to the FAST score (p = 0.011, Delong test), but had similar diagnostic accuracy to Agile 3+ . Among the biomarker-based NITs, NFS and FIB4 had the highest AUROC (0.69 and 0.67) but were not significantly different (p = 0.461, Delong test). We performed univariate and multivariate analysis to identify factors associated with AF in this population and found age, diabetes, BMI ≥ 25 kg/m2, AST, and platelet counts to be significant (Supplementary Table 3).

Table 2 Area under ROC curve (AUROC) for non-invasive tests for detecting advanced fibrosis (AF) in all cohorts (N = 641)
Fig. 2
figure 2

Area under ROC curve (AUROC) of various non-invasive tests for detecting advanced fibrosis (AF). LSM Liver stiffness measurement, NFS NAFLD Fibrosis score, APRI AST-Platelet Ratio Index, AAR AST-ALT ratio, LSM liver stiffness measurement, Agile 3+

Diagnostic Accuracy of NITs for Various BMI Categories

We compared the AUROC of each NIT for diagnosis of AF with BMI groups both below and above thresholds of 25, 30, 35, and 40 kg/m2 (Table 3). The most accurate NITs, LSM, and Agile 3+ remained accurate over these BMI categories.

Table 3 Area under ROC curve (AUROC) for non-invasive tests for detecting advanced fibrosis (AF) in various BMI strata (N = 641)

Existing NITs Underperform at Recommended Thresholds

We reviewed existing major society guidelines [19,20,21] and compared the sensitivity and specificity data from our study population and that described by other cohorts. We found that thresholds of FIB4 and NFS that rule out NAFLD had showed low sensitivity in our cohort (66% and 65%, respectively). The 90% sensitivity and 90% specificity cut-offs in the study population for LSM were 7.3 and 12.7 kPa, for FIB4 0.66 and 2.45, and for NFS − 2.97 and 0.29, respectively (Table 4).

Table 4 Area under ROC curve (AUROC), sensitivity, specificity of various non-invasive tests for advanced fibrosis (except for FAST and Agile 4) as per cut-offs described in the literature (N = 641)

Two-Step Approach Combining Biomarker-Based NIT and LSM with TE

We assessed the diagnostic accuracy for identifying AF using a combination of two tests that included a biomarker-based NIT followed by TE (Table 5). As a single test, LSM performed the best with 55.4% patients correctly classified for AF, while 10.1% patients were incorrectly classified, and 34.5% patients remained unclassified. The biomarker-based NITs (FIB4, NFS) had a large proportion of patients (64–68%) in the unclassified category. The combination score Agile 3+ did not improve diagnostic classification compared to LSM alone, with respect to the proportion of correctly or incorrectly classified patients. We applied a two-step diagnostic model for biomarker-based NITs followed by LSM. The first approach utilized both rule-out and rule-in criteria of FIB4 or NFS followed by LSM only in patients not classified by either FIB4 or NFS. However, there was an increase in the number of false positives based on the rule-in criteria of FIB4 or NFS (from 10.1% in LSM alone to 16.5%). A second algorithm was used where 90% sensitivity threshold for FIB4 or NFS was used to rule out AF followed by TE, and this approach yielded similar correctly classified patients with a lower number of incorrectly classified. The highest numerical percentage of correctly classified cases (62.4%), as well as the lowest percentage among indeterminate cases (27%), was found to be with a sequence of NFS (rule out criteria only) followed by LSM (i.e., NFS < − 2.97 to rule out, and subsequently apply LSM criteria to all cases with NFS ≥ − 2.97) (Fig. 3). One-fourth (155, 24.8%) of patients had an NFS of < − 2.97 and thus may avoid referral for TE as per this algorithm. Similarly, for FIB4 (rule out at < 0.66) followed by LSM criteria as above, correctly classified and indeterminate cases were 60.5% and 28.4%, respectively.

Table 5 Diagnostic accuracy of individual non-invasive tests and two-step including biomarker-based test followed by liver stiffness measurement
Fig. 3
figure 3

Algorithm for the non-invasive assessment of advanced fibrosis in patients of NAFLD, using NFS rule out criteria followed by liver stiffness measurement (LSM)

At currently recommended threshold of FIB4 < 1.3 (rule out) and ≥ 2.67 (rule in) for example, the classification yielded number of TP: 37, TN: 296, FP: 39, FN: 57, and corresponding correctly classified, indeterminate, and incorrectly classified proportions of 52%, 15%, and 33%, respectively. Similarly, the recommended NFS criteria at < − 1.45 (rule out) and 0.67 (rule in) yielded TP: 23 TN: 320, FP: 33, FN: 56, and corresponding correctly classified, indeterminate, and incorrectly classified proportions of 55%, 14%, and 31%, respectively.

The clinical characteristics as well as measures of NITs of the 3 cohorts which could not be included due to lack of test-positive (no patients with advanced fibrosis) or non-availability of TE data are shown in Supplementary Table 4. Available biomarker-based NITs in the cohort from China (n = 67) demonstrated poor diagnostic accuracy (Supplementary Table 5).

Discussion

Advanced fibrosis is an important prognostic factor in NAFLD, correlating with liver-related outcomes and mortality. Considering the invasive nature of liver biopsies and the increasing global prevalence of NAFLD, use of NITs has become widespread to predict AF and refer patients to specialists. Despite abundant information from the West suggesting sufficient diagnostic accuracy of existing NITs, there is sparse validation of the same from the East. We found poor diagnostic accuracy of currently used biomarker-based NITs (AUROC < 0.7) at cut-offs recommended by existing society guidelines in an analysis of four cohorts from Asia. LSM measured by TE (and Agile 3+ score) is the most accurate diagnostic tests for identifying AF. A two-step model using either FIB4 or NFS (rule out criteria only)-LSM sequence accurately classified nearly two-thirds of patients with AF.

Factors Affecting Diagnostic Accuracy of Biomarker-Based NITs

All currently used biomarker-based NITs had poor diagnostic accuracy in the four cohorts from 3 centers (two from India and one from Singapore) in the main analysis, as well as the Chinese cohort. These tests are dependent on several factors. First, NITs have been shown to have poor accuracy in Asian populations. Even within a particular region, these tests were less accurate among patients of South Asian descent [30]. A post hoc analysis found similar diagnostic accuracy for FIB4 and NFS between White and Asian populations (AUROC between 0.73 and 0.80) but did not analyze South Asians separately [28]. Second, obesity and BMI influence NITs, as shown by a study on 584 patients with a mean BMI of 45 kg/m2 that reported AUROC for NFS, FIB4, and APRI to be below 0.62 [29]. Our results suggest worse performance of certain NITs (NFS and APRI) among patients with higher BMI categories (Table 3). Third, increasing age is associated with an increase in liver stiffness, which further confounds the expected presence of NASH-related fibrosis, and a higher FIB4 threshold has been proposed for those over 65 years of age [35]. Fourth, the NITs are calculated from parameters (AST, ALT, and platelet counts) with significant intra-individual variation and thus measurement error [36]. Lastly, population prevalence and pre-test probability immensely affect the positive and negative predictive values (PPV, NPV) of a diagnostic test—this is particularly true for AF in NASH, where most tests show a high NPV and a low PPV [37, 38]. In a population-based study from Hong Kong among 922 individuals with a 2% prevalence of AF, a negative FIB4 or NFS had a high NPV (98%) for AF, but sensitivity was low (7–13%). [39]

Choice Among NITs

LSM has the highest AUROC, in patients from Asia (0.91), Europe and America (0.89), and is relatively unaffected by BMI, similar to the findings of our study [40]. This BMI independence of LSM is contingent on using appropriate XL probes patients with BMI ≥ 30 kg/m2 [41]. Our results support that LSM is more accurate than all biomarker-based NITs in detecting AF but the cost and limited availability of TE may preclude its universal implementation among all NAFLD patients. Second-best choice among NITs were FIB4 and NFS, with low accuracy (AUROC < 0.7). Another panel, the easy liver fibrosis test (eLIFT) has been developed for ‘bedside use’ based on simple parameters (age, gender, AST, GGT, platelet count, and prothrombin time) for population screening [42]. However, it contains multiple parameters and is thus difficult to use bedside. Its accuracy was similar compared to FIB4 (AUROC 0.78 vs 0.79, p = 0.42). GGT was not measured in our study and thus eLIFT could not be assessed. Recommended FIB4 cut-offs in Asians have poor sensitivity and specificity, and the 90% sensitivity and specificity cut-offs should be 0.66 and 2.45. The use of the existing FIB4 cut-offs defined in literature will lead to low positive predictive value (10.5%) in Asian cohorts; although even with modified cut-offs, the PPV remains low (25.3%). Importantly, there was a high proportion of incorrectly classified patients (> 30%) for AF when existing guideline criteria were adopted for Asian patients for FIB4 and NFS.

Two-Step Diagnostic Algorithm

Each diagnostic test that utilizes a rule-out and rule-in cut-off, which creates a gray zone of unclassified individuals where the diagnosis remains indeterminate. All biomarker-based NITs have an unacceptably high proportion of patients (> 50%) that cannot be classified using the rule-in or rule-out criteria alone. A combination of 2 NITs can help further assess and reduce the unclassified proportion, the proportion of cases not allotted a diagnostic label [43, 44]. Two-step application of FIB4 followed by ELF™ improves the detection of AF five-fold and reduces referrals by 80% [26, 44]. Boursier et al. compared sequential application of NITs and found that an algorithm with TE or FIB4 as the first test followed by FibroMeterVCTE carried high diagnostic accuracy, correctly classifying 90% of their population for the presence of AF [43, 45]. Unlike the present study, their population was derived from French centers and most likely included a predominant Caucasian ethnicity. We did not assess ELF, FibroMeter, and FibroTest that require proprietary fibrosis markers beyond routine blood tests. The GO ASIA working group has also shown utility of a two-step approach of NFS or FIB4 followed by LSM in an Asian setting [46].

In our population, we found the highest diagnostic accuracy with a sequence of NFS followed by TE to identify AF. NFS (rule out)-LSM sequence could correctly classify 62.4% of the population, while 27% remained unclassified. The advantage of this two-step algorithm over LSM alone is a reduction of unclassified cases from 34.5 to 27%. Thus, an NFS of > − 2.97 can be used as a referral basis for TE in resource-limited settings—reducing referral for approximately 25% of patients with NFS below this threshold and emphasizing that NFS or other biomarker-based NITs have low specificity for ruling in AF. An algorithm of FIB4 followed by shear-wave elastography [SWE] has been reported to be cost-effective, with cost-per-individual and effectiveness being 103.93$ and 0.92, followed by FIB4-LSM (135.84$, 0.88), outperforming LSM (226.95$,0.84), FIB4 (296.2$,0.76), and NFS (541.1$,0.67) [47].

How to Approach an ‘Unclassified’ Reading?

A classification that yielded an indeterminate result, such as NFS above − 2.97 along with LSM between 7.3 and 12.7, was found in 27% of patients using the NFS (rule-out) followed by LSM algorithm. Such test results pose difficulty to the physician while planning further evaluation and management. A few options exist—first, a liver biopsy may be recommended to confirm or refute AF. In real life, less than a quarter of gastroenterologists perform a liver biopsy in patients with NAFLD and more than half initiate therapy with vitamin E without biopsy confirmation of NASH [48]. Liver biopsy rates in clinical practice outside referral centers are likely to be even lower. Second, the physician may follow-up after a period of time, given the slow progression of fibrosis in these patients. An increase in test parameters meeting rule-in criteria may clarify the treatment plan. Alternatively, an increase in test values over time itself can indicate an increased risk of adverse liver events. For example, an increase by 1 point value in FIB4 or transitioning from low-risk to high-risk value was associated with an increased risk of liver events with a hazards ratio of 1.81 and 7.99, respectively [49]. Third option is to consider that the patient has AF and initiate lifestyle changes for therapeutic benefit, since this will likely reduce the cardiovascular risk and prevent the development of type-2 diabetes mellitus, in addition to reducing liver-related events.

Combination Tests

FibroScan-based tests such as Agile 3+ , Agile 4, and FAST score have been used to accurately diagnose AF (≥ F3 fibrosis), cirrhosis (F4 fibrosis), and fibrotic NASH, respectively. [22, 23] They help to reduce the number of patients in unclassified category. Sanyal et al. found Agile 3+ to improve AUROC for diagnosis of AF to 0.90 which was superior to LSM (vs 0.86, p < 0.001), and reduce proportion of indeterminate patients from 23 to 13%. [22] However, we found LSM and Agile 3+ to have similar diagnostic accuracy (0.82 each, p = 0.438) as well as a similar proportion of unclassified patients (34% and 39%), respectively. Thus, in our Asian cohort, Agile 3+ , an equation derived and validated from a predominantly Caucasian population did not yield incremental benefit over using LSM alone.

Strengths and Limitations

Our study analyzed a large sample size from 4 different cohorts to determine the diagnostic accuracy of NITs in detecting the presence of AF in biopsy-proven NAFLD. However, it had a few limitations: ours was a retrospective study that may have been prone to selection bias in choosing patients for liver biopsy. Second, liver biopsy is an imperfect gold standard, with inter-observer and intra-observer variability, particularly due to different pathologists at each center interpreting fibrosis stage. We did not assess NITs that depend on proprietary tests, since they have limited availability in resource-limited settings.

In conclusion, our study found that the established biomarker-based non-invasive tests perform poorly in Asian settings. LSM is the single most accurate non-invasive test to detect the presence of advanced fibrosis in NAFLD patients. Combination tests of LSM and blood investigations such as Agile 3+ and FAST scores have high AUROC but do not provide any additional diagnostic accuracy compared to LSM alone. Sequential application of NFS (rule-out)-LSM can correctly classify advanced fibrosis in two-thirds of cases. This reduces the proportion of unclassified cases and potentially avoid costs of TE in one-fourth of cases in whom NFS can rule out AF with 90% specificity.