Introduction

Nonalcoholic fatty liver disease (NAFLD) is a silent epidemic with over 12 million cases diagnosed annually in the USA alone [1]. Globally, NAFLD is thought to affect one in four individuals and parallels the rise in obesity and metabolic syndrome [2, 3]. NAFLD is a leading cause of end-stage liver disease, resulting in liver-related complications including hepatocellular carcinoma (HCC) and death [4, 5]. NAFLD is now one of the leading causes of liver transplant [6].

NAFLD encompasses both nonalcoholic fatty liver (NAFL) and nonalcoholic steatohepatitis (NASH) [4, 7]. NAFL is defined by the presence of steatosis (≥ 5%) in the absence of inflammation or hepatocellular injury (hepatocyte ballooning on histology), while NASH refers to steatosis (≥ 5%) occurring in the presence of both [8]. NAFL and NASH exist on a spectrum, with steatohepatitis signifying an increased risk of progressive hepatic injury [9]. The natural history of NAFLD is variable with ~ 20% of patients progressing to advanced fibrosis, which is the main predictor of clinical outcomes [10,11,12]. However, most patients remain asymptomatic, and even advanced NAFLD may be easily overlooked in clinical practice [4].

Liver biopsy is the established standard for diagnosis of NASH and fibrosis, but is an invasive procedure with associated risk, along with diagnostic limitations due to sampling error and interpretation. Several blood-based and imaging elastography noninvasive tests (NIT) have been developed to predict advanced NAFLD fibrosis and reduce the need for biopsy [13, 14]. These include simple, indirect serum marker algorithms such as NAFLD fibrosis score (NFS), FIB-4, aspartate aminotransferase (AST)-platelet ratio index (APRI), and BARD (body mass index (BMI), AST/alanine aminotransferase (ALT) ratio, diabetes), that are routinely available and have comparable diagnostic accuracy for advanced fibrosis [15]. A significant limitation of simple NITs is that at least 30% of patients are classified as “indeterminate,” with index scores that are between cutoffs to “rule in” or “rule out” advanced fibrosis [16,17,18]. Diagnostic accuracy is further reduced by misclassified cases, and as a result, these simple tests are not yet able to replace biopsy for diagnosis of advanced fibrosis in NAFLD. Other validated proprietary serum marker tests such as FibroTest (BioPredictive, Paris, France; FibroSURE, Labcorp, Burlington, NC), Enhanced Liver Fibrosis (ELF) score (Siemens Healthcare, Erlangen, Germany), or FibroMeterVCTE (Echosens, Paris, France) may increase diagnostic accuracy as compared to simple markers, but are not as readily available [13]. Imaging-based NITs such as vibration-controlled transient elastography (VCTE, Echosens, Paris, France), shear-wave elastography, or magnetic resonance (MR)-based techniques are being increasingly adopted in diagnostic algorithms for advanced fibrosis in NAFLD, but limited by variable optimal liver stiffness thresholds across different cohorts and are not as readily available or as cost-effective as simple markers for fibrosis assessment [19].

Sequential combinations of blood-based markers and VCTE were initially developed and validated in chronic hepatitis C, to reduce the need for liver biopsy to diagnose significant fibrosis (F2–4) prior to interferon-based therapy [20, 21]. Recently, the sequential use of NITs such as FIB-4 and ELF or VCTE has been evaluated for risk stratification for NAFLD advanced fibrosis and to improve referral pathways from primary care [22, 23]. Sequential elastography with blood-based markers can improve diagnostic accuracy for advanced fibrosis and reduce indeterminates [24, 25]. However, sequential NIT approaches that still require validation, optimal combinations, or thresholds have not been established, and proprietary tests such as ELF are not yet available for routine use, outside of a research setting, in North America. The European Association for the Study of the Liver (EASL) Clinical Practice guidelines for NAFLD indicate that although combined NIT approaches may improve accuracy, there is no consensus on thresholds or strategies for use in avoiding biopsies for advanced fibrosis [26].

Our aims were to identify if stepwise combination of simple NITs (1) reduces indeterminate rates for predicting advanced NAFLD fibrosis (F3–4) compared to individual NIT, (2) identifies the optimal sequential combination of simple NITs to reduce the need for biopsy for the diagnosis of advanced fibrosis (F3–4), and (3) validates the performance of sequential NITs in an independent external cohort.

Methods

Study Design and Population

This was a retrospective observational cohort study of patients with NAFLD from two Canadian tertiary-care centers: University Health Network, Toronto (Training cohort) and McGill University Health Center, Montreal (Validation cohort).

Patients included were ≥ 18 years of age and had undergone a liver biopsy to assess for NASH between January 1, 2010 and July 1, 2018. Patients were excluded if they had an alternate cause of chronic liver disease or steatosis at biopsy, including viral hepatitis, significant alcohol use, Wilson’s disease, autoimmune hepatitis, α1-antitrypsin deficiency, hemochromatosis, and long-term use of steatogenic medications (including amiodarone, tamoxifen, methotrexate, tetracycline, and glucocorticoids). Further exclusion criteria were presence of malignancy (except HCC) within the past 5 years, immunosuppressive medications within the past 3 years, human immunodeficiency virus (HIV) co-infection, and inadequate liver biopsy (< 10 mm or based on pathology assessment). Significant alcohol use was defined as > 14 units weekly (or > 2 units daily) for women and > 21 units weekly (or > 3 units daily) for men. Anthropometric data and bloodwork within ± 6 months of liver biopsy were included, and VCTE if a valid liver stiffness measure (LSM) was available within ± 12 months of liver biopsy.

Histologic Analysis and Data Acquisition

Liver biopsies were assessed by tertiary center expert histopathologists at each site. In addition, biopsy report summaries outlining the pathologists’ clinical impression were verified by an independent clinical researcher, and those identifying an alternate cause of fibrosis or steatosis as per study exclusion criteria were omitted. NASH and fibrosis were scored using the NASH Clinical Research Network (NASH-CRN) Scoring System [8]. The presence or absence of “advanced fibrosis,” defined as CRN score 3–4 (bridging fibrosis or cirrhosis) was recorded for each biopsy.

Clinical data were acquired via the local electronic medical record. Baseline data included demographics (age at time of biopsy, gender, comorbidities, medications), anthropometric measurements (height, weight, BMI), biochemical tests (liver enzymes: AST, ALT, alkaline phosphatase (ALP), gamma-glutamyl transferase (GGT); liver function: bilirubin, International Normalized Ratio (INR), albumin, glucose; complete blood count (CBC), electrolytes, creatinine, lipid profile, glycated hemoglobin (HBA1c), ferritin), and VCTE reports. Comorbidities i.e., hypertension, diabetes, dyslipidemia, etc., were recorded based on physician reporting of these conditions within the medical record. A total of 814 liver biopsies were obtained to assess for NAFLD in the Toronto cohort between January 1, 2010 and July 1, 2018. Following initial review of biopsy reports, 196 biopsies were excluded based on stated inclusion and exclusion criteria. The remaining 620 biopsies underwent clinical review, of which 213 were subsequently excluded. The remaining 407 biopsies comprised the training cohort in this study (Fig. 1).

Fig. 1
figure 1

Training cohort patient selection and biopsy exclusions. NAFLD nonalcoholic fatty liver disease, CLD chronic liver disease, HIV human immunodeficiency virus, MRN medical record number, A1AT alpha-1 antitrypsin deficiency, AIH autoimmune hepatitis, PBC primary biliary cholangitis, PSC primary sclerosing cholangitis, GVHD graft versus host disease, HBV hepatitis B virus, HCV hepatitis C virus, HLH hemophagocytic lymphohistiocytosis, DILI drug-induced liver injury, TPN total parenteral nutrition

Serum-based NIT of NAFLD fibrosis, including NFS, FIB-4, BARD, APRI, and AST/ALT ratio, were calculated for all included patients using published formulas [27,28,29,30]. Validated thresholds predicting advanced F3–4 fibrosis used for the purposes of this study were NFS ≥ 0.676, FIB-4 > 2.67 [31], APRI > 1.5 [32], and AST/ALT ratio ≥ 0.8. Cutoffs predicting F0–2 or the absence of advanced fibrosis were NFS < − 1.455, FIB-4 < 1.3, BARD < 2, APRI < 0.5, and AST/ALT ratio < 0.8. Valid LSM ≤ 8.4 kPa, using the “M” or “XL” probe, was selected to rule out F3–4 fibrosis, with 8.4 kPa representing the Youden associated LSM cutoff for VCTE in our training cohort.

Sequential Algorithms for Prediction of NAFLD Fibrosis

Three algorithms were constructed in the training cohort using a stepwise combination of NIT: (1) FIB-4 ➔ NFS, (2) FIB-4 ➔ VCTE, (3) FIB-4 ➔ NFS ➔ VCTE (Fig. 2). Standardized cutoffs were used as described above. The second test was applied to patients with “intermediate” range index scores (FIB-4 = 1.3–2.67 for algorithms 1 and 2, and NFS = − 1.455–0.675 for algorithm 3). Patient scores falling above/below cutoffs were labeled as having “high” or “low” likelihood of F3–4 fibrosis. Misclassification rates were reported at each stage of the decision tree. For an algorithm to avoid the need for biopsy, the following criteria were required: (1) sensitivity ≥ 80%, (2) specificity ≥ 80%, and (3) false negative rate < 20%.

Fig. 2
figure 2

Sequential algorithms for advanced fibrosis

Statistical Analysis

Student’s t test was used to compare quantitative data. Chi-squared test was used for comparison of frequency data. The Mann–Whitney U test was used to compare ordinal data. All tests are two-tailed, with p < 0.05 set as the level of statistical significance. All tests assumed equal variance unless standard deviations between compared groups differed substantially. Qualitative variables are expressed as mean ± SD with 95% confidence interval, assuming a normal distribution, based on the central limit theorem. Area under the receiver operating curve (AUROC), as described by DeLong et al. [33], was used to determine accuracy of noninvasive tests and sequential algorithms. To better account for the non-binary nature of biopsy as a reference standard, and spectrum effect across an ordinal scale, the weighted Obuchowski measure was also calculated for each NIT algorithm [34]. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and likelihood ratios were determined for all sequential algorithms, following removal of indeterminate range index scores. For individual NIT, performance characteristics are reported at the maximal Youden Index. For sequential algorithms, given the dichotomous nature of these results, predictive values were calculated using standardized statistical formulae via production of 2 by 2 tables.

Statistical analyses were performed using Medcalc (MedCalc Software Version 19.0.7, Ostend, Belgium). “R” statistical software (R Development Core Team, 2008) was used for “non-binary ROC analysis,” to calculate the Obuchowski measure, using the “nonbinROC” R package created by Paul Nguyen (DOI: 10.1.1.215.7235).

Results

Training Cohort

Demographics

Baseline clinical characteristics for the 407 patients from the training cohort are shown in Table 1. The prevalence of advanced fibrosis was 48% (n = 196/407), and compared to F0–F2, patients with F3–4 were more likely to be older (53.9 vs. 43.4 years), female (57% vs. 36%), with higher BMI (33.4 vs. 31.0), and receiving treatment for coexisting metabolic disease (hypertension, type 2 diabetes, dyslipidemia). There were no differences between the F0–2 and F3–4 cohorts for reported rates of smoking (30.3% vs. 25.5%, p = 0.28).

Table 1 Baseline clinical characteristics in training cohort

Index scores for simple markers NFS, FIB-4, APRI, BARD, AST/ALT were all significantly higher in F3–4 patients. However, mean NFS for F3–4 was lower than the threshold of 0.676. As expected, patients with F3–4 had significantly higher LSM compared to stage F0–2 patients (6.7 ± 2.3 kPa vs. 20.6 ± 16.1 kPa, p < 0.0001).

Single Noninvasive Test Performance for F3–4

The AUROC values for predicting F3–4 using NIT in training cohort ranged from 0.70 to 0.92 (Obuchowski 0.66–0.86) (Table 2). VCTE was only available in a subset of patients (n = 80). A LSM threshold of 8.4 kPa, the Youden cutoff for our training cohort, had the highest AUROC of 0.92 with sensitivity 0.86 and specificity 0.87. AUROCs for NFS, FIB-4, and APRI were determined following exclusion of 25–39% of patients with Indeterminate results (Table 2). Of the simple biomarker-based tests, FIB-4 had high specificity of 0.94 but lower sensitivity of 0.66 and AUROC 0.83, as compared to AUROC 0.78 and 0.80 for NFS and APRI, respectively. There were no significant differences between AUROCs for FIB-4, NFS, APRI, or VCTE. Misclassification rates for simple biomarkers ranged from 17 to 35% (FIB-4 17%, NFS 23%, APRI 23%, BARD 35%, AST/ALT 35%). VCTE had a lower misclassification rate of 14% (Table 2).

Table 2 Noninvasive tests for the prediction of stage F3–F4 fibrosis in training cohort

Sequential Noninvasive Algorithms for Advanced Fibrosis

We next evaluated three sequential algorithms of NIT for the prediction of advanced F3–4 NAFLD fibrosis, with the aim of reducing the proportion of indeterminate and misclassified patients. FIB-4, NFS, and VCTE were selected for their favorable performance as single NITs in the training cohort and availability in clinical practice. Sequential algorithms were constructed as (1) FIB-4 → NFS, (2) FIB-4 → VCTE, (3) FIB-4 → NFS → VCTE (Fig. 2).

Algorithm 1 (FIB-4 → NFS) in patients with FIB-4 score 1.3–2.67 resulted in AUROC of 0.77, with high specificity of 0.95 and lower sensitivity of 0.60. Compared to single NITs, this approach reduced indeterminates to 26/318 (8%) patients, with an overall misclassification rate of 20%.

Algorithm 2 (FIB-4 → VCTE) resulted in an AUROC for F3–4 of 0.81, with a sensitivity of 0.68 and a specificity of 0.95. There were no indeterminate scores, and the misclassification rate was 17%.

Algorithm 3 (FIB-4 → NFS → VCTE) in patients who were indeterminate by FIB-4 analysis, underwent NFS for further stratification, followed by VCTE for patients still classified as indeterminate by NFS. resulted in the AUROC for F3–4 as 0.78, with a sensitivity of 0.62 and a specificity of 0.94. There were no patients left with indeterminate scores, and misclassification rate was 20% (Table 3).

Table 3 Sequential algorithms for the prediction of F3–4 fibrosis in training cohort

Parallel Combination of Noninvasive Tests for Advanced Fibrosis

We also compared performance characteristics of FIB-4 in parallel to NFS (FIB-4 + NFS) for advanced fibrosis. Patients with both FIB-4 and NFS meeting thresholds for presence or absence of advanced F3–4 fibrosis were classified as “likely” or “unlikely” to have advanced fibrosis. If either FIB-4 or NFS was not in agreement, patients were labeled as indeterminate. Overall concordance of FIB-4 and NFS was 73%. Among patients with a FIB-4 score indicating F0–2, 85% (n = 85/99) also had a concordant NFS score. Among patients with an indeterminate FIB-4 score, 49% (26/53) had a concordant NFS score. For patients with a FIB-4 score indicating advanced fibrosis, 76% (35/46) had a concordant NFS score. Overall, performance characteristics of this parallel FIB-4 + NFS (n = 198) indicated AUROC of 0.81, with sensitivity = 0.67, specificity = 0.96, PPV = 0.91, NPV = 0.81. Use of a parallel analysis thus resulted in comparable diagnostic performance to algorithm 1, with a marginally higher sensitivity, but was limited by a higher indeterminate rate (78/198, 38%). Parallel FIB-4 + NFS did lower overall misclassification rates, with a rate of 16% (19/120 patients) versus 20% for the sequential FIB-4 → NFS algorithm. Overall, 54% were either misclassified or indeterminate for the parallel combination, compared to 28% for this sequential algorithm.

Validation Cohort

Demographics

Baseline clinical characteristics for the n = 134 patients from the external validation cohort as compared to the original cohort are shown in Table 4. The prevalence of advanced fibrosis in the validation cohort was 43% (57/134) and comparable to 48% for the training cohort (p = 0.31). Patients in the validation cohort were older (52.5 vs. 48.5 years; p = 0.0016), with lower mean AST (49.6 vs. 80.2 IU/L; p = 0.0036) and ALT (68.4 vs. 98.4 IU/L; p < 0.0001), but otherwise comparable for biological sex, BMI, and proportion with diabetes mellitus. Despite a similar prevalence of F3–4, mean index scores for FIB-4, NFS, and APRI were lower in the validation cohort compared to the training cohort, but there were no differences in LSM (Table 4).

Table 4 Baseline clinical characteristics for training and validation cohorts

Single Noninvasive Test Performance

Results of blood-based NIT were comparable between training and validation cohorts, with AUROC 0.68–0.81 for F3–4 (Supplementary Table 1). VCTE did not perform as well, with a lower AUROC = 0.66. NIT performed with high specificity (> 0.8) for all scores except FIB-4, which performed with higher sensitivity (0.88). Rates of indeterminate scores for NFS, FIB-4, and APRI, were 40%, 34%, and 45%, respectively. These were higher as compared to the training cohort. Misclassification rates ranged from 23 (AST/ALT) to 44% (VCTE). Obuchowski measure ranged from 0.63 (VCTE) to 0.74 (FIB-4).

Sequential Noninvasive Tests for Advanced Fibrosis in Validation Cohort

Sequential algorithm AUROCs were 0.67–0.70 (Obuchowski 0.65) in the validation cohort. Overall performance characteristics were lower compared to the training cohort (Table 5). Algorithm 1 (FIB-4 → NFS) performed with highest specificity for identifying F3–4 fibrosis.

Table 5 Sequential algorithms for the prediction of F3–4 fibrosis, external validation cohort

Algorithm 1 (FIB-4 → NFS) resulted in modest AUROC of 0.67, with specificity of 0.82 but low sensitivity of 0.56. Compared to single NITs, this approach reduced indeterminates to 15% with an overall misclassification rate of 30%.

Algorithm 2 (FIB-4 → VCTE) resulted in AUROC for F3–4 of 0.69. There were no indeterminate scores, but as expected from the sensitivity and specificity of 0.67–0.71, the misclassification rate was 31%.

Algorithm 3 (FIB-4 → NFS → VCTE) resulted in an AUROC for F3–4 was 0.70, with a sensitivity of 0.62 and a specificity of 0.78. There were no indeterminate scores, and the misclassification rate was 29% (Table 5).

Modeling for Biopsy Avoidance in a Combined Cohort

Based on our data for the combined cohort (Supplementary Tables 2 and 3), we next determined whether a single test or algorithm could reduce the need for liver biopsy diagnosis of advanced fibrosis in the following scenario: (1) negative score (below cutoff) for a test with sensitivity > 80% and (2) positive score (above cutoff) for a test with specificity > 80%, with a type II error rate of < 20%. At a prevalence of 47% (253/541) for advanced fibrosis, liver biopsy could have been avoided in 27% (112/416), 29% (112/387), and 29% (116/395) of patients using sequential algorithms 1, 2, and 3, respectively. This compares to 22% (61/278), 21% ([98/469), and 35% (59/168) for NFS, FIB-4, and VCTE as single tests, respectively. Differences in biopsies saved were significant for Algorithm 1 versus FIB-4 (p = 0.0363), Algorithm 2 versus NFS (p = 0.0427), Algorithm 2 versus FIB-4 (p = 0.0069), Algorithm 3 versus FIB-4 (p = 0.0066), and Algorithm 3 versus NFS (p = 0.0419).

Discussion

The identification of advanced fibrosis is important for risk stratification and clinical management in NAFLD. Simple marker algorithms, such as FIB-4 and NFS, are readily available for the noninvasive prediction of advanced fibrosis but are limited by indeterminate scores in a significant proportion of patients. Our study demonstrates the utility of combining simple NITs for the prediction of advanced fibrosis in NAFLD patients and further validates performance characteristics of reducing indeterminates using sequential algorithms in an independent cohort. Our data indicate that the use of sequential NITs can significantly reduce the number of “indeterminate” results compared to single tests, without compromising diagnostic performance, or increasing misclassification rates. Furthermore, by using sequential tests, liver biopsy for the diagnosis of advanced fibrosis could have been avoided in up to 29% of NAFLD patients in our cohort and 27% using sequential combinations of serum-based tests alone.

Individual NIT identified advanced fibrosis in our cohorts with good performance, and overall diagnostic accuracy and indeterminate rates were comparable to prior studies [17, 24, 25, 27, 31]. Both parallel and sequential simple blood marker NITs were evaluated in the training cohort. Compared to their sequential application (in algorithm 1), FIB-4 + NFS in parallel marginally reduced misclassification rates but led to indeterminate scores of 38%, that were higher than observed with individual tests. A prior multicenter study in 761 NAFLD patients indicated that paired VCTE with NFS or FIB-4 also reduced misclassification rates, but resulted in increased indeterminate rates in over one-half of patients and thus reduced the overall accuracy for paired tests to diagnose F3–F4 to ~ 40% [24]. Data from a phase III clinical trial in NASH stage 3–4 patients indicated the simultaneous combination of FIB-4 or NFS with ELF or VCTE resulting in low misclassification rates of 4–8%, but unacceptable indeterminate classification in 64–77% of patients [36]. Based on findings from these larger cohort studies, we did not further evaluate paired VCTE + FIB-4 or NFS in our cohort. We also observed discordance rates of ~ 30% between FIB-4 and NFS which further limited this diagnostic approach in our cohort. Overall, these findings suggest that parallel combination of current NITs have limited utility for diagnosis of advanced fibrosis in NAFLD.

The strengths of our study include the combined cohort of > 500 patients with biopsy-proven NAFLD confirmed by expert histopathologists, and first to use sequential NIT diagnostic algorithms with an independent validation cohort in a North American NAFLD population. We selected NFS and FIB-4 as first-line tests for ease of use, cost-effectiveness, and access. Importantly, in our overall cohort, sequential FIB-4 → NFS allowed for reduction of indeterminates to 10% compared to 29% for individual tests, while maintaining specificity > 0.9. Overall misclassification rates for FIB-4 → NFS were 22% and comprised of false negatives in > 75% of these misclassified patients, highlighting the lower sensitivity of these simple serum marker tests in our cohorts. Our second sequential algorithm (FIB-4 → VCTE) had an overall specificity of 0.88 for advanced fibrosis, reduced indeterminates to 0%, with a misclassification rate of 20%, of which two-thirds of cases were false negatives (Supplementary Table 3). This compares favorably with recent studies that indicated indeterminate rates of 6–20% and misclassification rates of 16–20% using this sequential algorithm [24, 36]. These differences in indeterminate rates between our data and these recent studies relate to the use of a single lower LSM threshold of 8.4 kPa to “rule out” advanced fibrosis in our cohort, compared to either < 7.9 kPa and ≥ 9.6 kPa [24] or 9.9 kPa and 11.4 kPa [36] as lower and upper thresholds for “ruling out” or “ruling in” advanced fibrosis, respectively. Higher prevalence of F3–4 may also result in higher misclassification and indeterminate results with a sequential approach [37]. Algorithm 3 did not provide any incremental diagnostic accuracy and change in misclassifications compared to FIB-4 → VCTE. In the multicenter study of 761 NAFLD patients, the prevalence of advanced fibrosis was much lower at 30%, and in keeping with our findings, the best-performing single NITs for advanced fibrosis were FIB-4, NFS, and VCTE. Sequential algorithms incorporating VCTE and FIB or NFS in various combinations improved diagnostic accuracy of identifying advanced fibrosis to ~ 70% [24]. However, performance of a sequential FIB-4 → NFS algorithm was not reported, which we feel is a unique strength of our study, owing to the simplicity and ease of access in performing these tests, and potentially overcoming some of the variability associated with obtaining reliable LSM assessments in non-tertiary clinical settings [23].

We did not explore performance of sequential tests based on new optimal thresholds derived from our combined dataset, as larger cohort studies have been unable to define single thresholds to optimize sensitivity and specificity for simple markers such as FIB-4 and NFS and thus remove these indeterminate scores [36]. Options to further reduce indeterminates and maintain accuracy with serum markers include the use of proprietary scores such as FibroTest or FibroMeter. However, these are associated with additional costs and were not included in this study. Other proprietary tests with validation for sequential use, including ELF™ and FibroMeter [25], are not yet routinely available or covered by provincial health plans in Canada, and VCTE is available for a non-reimbursed fee outside of tertiary centers.

Limitations of our study include that this was a retrospective study at tertiary centers with for-cause biopsy and a higher prevalence of F3–4 fibrosis compared to other studies [27, 31, 38, 39]. As such, further validation of these sequential algorithms is required in lower-prevalence populations. To account for the non-binary nature of our outcome measure and bias toward greater prevalence of F3–4, we evaluated the Obuchowski measure to account for differences in distribution of fibrosis scores in developing our sequential algorithms. However, only a few studies in NAFLD have accounted for the spectrum effect of variable fibrosis distribution in their cohorts during assessment of NIT diagnostic accuracy [25, 40, 41].

Due to increasing comorbidities, advanced age has been shown to reduce the specificity of FIB-4 [42]. Indeed, NIT were principally validated in adult NAFLD patients age < 65 years [16, 17, 31, 42]. We did not further control for this factor, as ~ 10% of patients were aged over 65 years in our combined cohorts, which reflects the general NAFLD population with respect to age. Ethnicity has also been shown to limit the performance of NIT. A recent study demonstrated the poor sensitivity of several NIT in South Asian patients [43]. This was also not controlled for in our study as ethnicity was inconsistently reported.

Highlighting the need for external validation in biomarker studies, differences in algorithm performance were noted between our training and validation cohorts despite having comparable rates of advanced fibrosis. At the selected thresholds, FIB-4 had a higher sensitivity and lower specificity in the external validation cohort. The higher sensitivity for FIB-4 in the validation cohort reflects the lower transaminase levels in this group. Despite lower AUCs for sequential tests in the external cohort, there was an important reduction in indeterminate rates for the FIB-4 → NFS algorithm by 17–19% and 21–25% for both training and validation cohorts , respectively.

VCTE was only available in a subset of our patients in this retrospective study. A cutoff of ≤ 8.4 kPa was selected based on the corresponding Youden index determined from our training cohort. This single low LSM cutoff served to optimize sensitivity and exclude patients without advanced fibrosis. Both M and XL probes were used for VCTE in this study. Prior studies have shown variable LSM diagnostic thresholds for F3–4 of 6.95–11.4 kPa using the M probe and 5.7–9.3 kPa for NAFLD F3–4 [44]. Our AUROC of 0.76 for VCTE in the combined cohort was lower than expected. VCTE performed with high sensitivity and specificity in our training cohort. Our validation cohort was limited by a higher false positive rate and lower specificity than expected, at the selected 8.4 kPa cutoff. This can be explained by the higher median LSM among F0–2 patients in the validation vs. training cohorts (10.8 kPa vs. 6.3 kPa, respectively). Further, differences may have occurred due to variability in operator technique or patient body habitus. Although both sites had very experienced operators and used a similar protocol of attempting measurement with the M probe, prior to use of the XL, if no reliable measurement could be obtained, further analysis of VCTE-specific parameters and reliability criteria would have been helpful to further elucidate observed differences between cohorts.

In this study, VCTE alone reduced need for liver biopsy at rates greater than sequential combinations of NIT using VCTE. This may relate to higher sensitivity for F3–4 for VCTE, but despite the differences in VCTE performance between our two cohorts, the use of VCTE as a sequential test improved specificity. In our region, VCTE is principally available at specialist centers and usually associated with non-reimbursed costs. Use of sequential FIB4 + NFS is a simple means of reducing need for liver biopsy, without compromising accuracy. Although our study did not assess cost-effectiveness, previous work has demonstrated that a sequential combination of serum-based NIT with VCTE have been shown to be more cost-effective than VCTE alone [45].

In future, given the low prevalence of advanced NAFLD in community practices and heterogeneity in disease-related outcomes [46, 47], a “triage” strategy based on combination NIT determined advanced fibrosis could reduce the number of referrals for secondary or invasive testing and reduce associated patient anxiety, discomfort, and healthcare costs [22, 48]. Although validation in a primary care setting could allow for implementation of sequential algorithms of NIT for diagnostic purposes, this may be difficult due to the need for liver biopsy as a comparative reference standard. Future work may instead focus on modeling strategies for screening using validated noninvasive tools such as MR elastography, allowing for assessment of NIT algorithms performance at a lower population prevalence. Additional assessment of the utility of such sequential algorithms with age-specific thresholds, validation in multi-ethnic cohorts, incorporation of VCTE-controlled attenuation parameter in diagnostic algorithms to reduce false positives [49], and risk stratification for disease progression or clinical outcomes is still required.

In order for a noninvasive testing strategy to be implemented in primary care, multiple variables must be considered and optimized, which are outside the scope of this study. These include optimization of test accuracy, sensitivity, and specificity, and minimizing misclassified (false positive and negative) and indeterminate results. Optimal thresholds for combined tests have yet to be determined, and minimizing false negative results is important for screening populations at risk of advanced fibrosis. However, validation of these diagnostic approaches in lower prevalence cohorts is still required. Cost-effectiveness of a sequential algorithm strategy for diagnosis and management of advanced fibrosis must also be taken into account.

In summary, the use of sequential NITs was able to accurately predict F3–4 fibrosis and reduce indeterminates, thereby reducing need for liver biopsy, compared to serum NITs used alone or in parallel. A combination of FIB-4 followed by NFS performed well and was comparable to sequential tests incorporating VCTE. The ability of simple NIT sequential algorithms to accurately predict advanced fibrosis with high specificity improves the possibility of identifying patients without biopsy who may benefit from closer surveillance as well as potential candidates for future therapies.