Introduction

Hepatocellular carcinoma (HCC) accounts for more than 800,000 deaths and 20 million disability-adjusted life years globally in 2017 [1]. For advanced HCC that are not feasible for surgical resection, treatment options are limited. Although the tyrosine kinase inhibitors, such as sorafenib and regorafenib, have been shown to prolong the survival in advanced HCC patients [2], not all patients responded to these treatments. Recent advances in immunotherapy have provided new therapeutic option for patients with advanced HCC [3,4,5]. As yet, immunotherapies are expensive and are associated with considerable side effects [6], it would be highly useful if there is a clinical tool to identify patients who would most benefit from these new treatments.

Artificial intelligence and machine learning (ML) algorithm are increasingly recognized to be highly accurate in the analysis of complex clinical factors, which could help to guide clinical treatment algorithm. It has already been shown that ML models could apply in many clinical decision-making processes, such as prediction of cardiovascular risk [7], sepsis [8], papilloedema [9], cancer risk [10] and colonic lesion histology [11], etc. Recently, we have shown that the use of ML model could help to determine individual’s risk of gastric cancer development after H. pylori eradication [12]. Machine learning has also been applied in the diagnosis and prediction of prognosis of liver disease including non-alcoholic liver disease [13,14,15], viral hepatitis [16,17,18], primary sclerosing cholangitis [19, 20] as well as hepatocellular carcinoma [20, 21] with good accuracy and stability.

The current study aimed to study the role of various available ML models in the prediction of 1-year cancer-related mortality in patients with advanced HCC receiving immunotherapy with comparison to conventional logistic regression as well as two well-known scoring systems for HCC prognosis namely the C-reactive protein and Alpha Fetoprotein in ImmunoTherapY score (CRAFITY) and the albumin–bilirubin (ALBI) score. [22,23,24]

Methods

Data source

All data were retrieved from the Clinical Data Analysis and Reporting System (CDARS), which is a territory-wide electronic health care database of the Hong Kong hospital authority used for clinical research and audit. The hospital authority is wholly funded by the Hong Kong Government and is the only public healthcare provider for the local population of 7.5 million. It manages over 85% of all hospital beds and more than 7.5 million outpatient specialist consultation per year. Important clinical data including patient’s demographics, diagnosis, medication prescription and dispensing records, laboratory results and patient outcome were captured by the CDARS. These data were all anonymized in the CDARS to protect patient’s confidentially. The International Classification of Diseases, Ninth Revision (ICD-9), is used for disease coding. The CDARS database had been used in a number of population-based studies [25,26,27,28,29,30]. This study protocol was approved by the Institutional Review Board of the University of Hong Kong and the West Cluster of the Hong Kong Hospital Authority (reference no: UW 20-778). This study was reported according to transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD statement; Supplementary Table 1) as well as the MI-CLAIM criteria with the checklist of compliance provided in Supplementary Table 2.

Patients

We identified all adult patients (aged 18 or above) with advanced HCC who had received immunotherapy between January 2014 and December 2019. HCC was identified by the ICD-9 diagnosis code of 155.0. Three immunotherapies were included in this study including nivolumab, pembrolizumab or ipilimumab. As immunotherapy has been recently approved by the Food and Drug Administration of the United States to be used in HCC only [31, 32], the use of the immunotherapy in these patients was mainly as off-label last treatment option for patients with advanced disease during the study period.

Patients were randomly divided into training set and internal validation set. There was a total of 48 clinical variables including baseline characteristics, age of receiving immunotherapy, type of immunotherapy, history of prior treatment (e.g., hepatectomy, transcatheter arterial chemoembolization, local ablative therapies including radiofrequency ablation and ethanol injection), use of tyrosine kinase inhibitors ([TKIs] sorafenib or lenvatinib), underlying liver diseases (chronic hepatitis B [HBV], chronic hepatitis C [HCV], alcoholic liver disease), comorbidities (diabetes mellitus [DM], hypertension [HT], ischemic heart disease [IHD], atrial fibrillation [AF], congestive heart failure [CHF], stroke and infection) and concomitant medication uses. Infection before or after the commencement of the first ICI was defined by ICD-9 coding including the following infectious disease codes (001–139, 320–321, 460–466, 470–478, 480–488, 540–543, 566, 567, 575.0, 576.1, 590, 595, 680–686). Laboratory parameters included alpha fetoprotein (AFP), bilirubin, alkaline phosphatase (ALP), alanine aminotransferase (ALT), aspartate aminotransferase (AST), albumin, globulin, international normalized ratio (INR), platelet count, serum creatinine, sodium (Na) and C-reactive protein (CRP). The CRP and AFP in ImmunoTherapY score (CRAFITY) and the albumin–bilirubin (ALBI) score were determined and compared with the performance of various ML models. The severity of liver disease was determined by the Child–Pugh Score (CPS).

The primary endpoint was the cancer-related mortality (with liver cancer specified as the cause of death) within 1 year after immunotherapy. Another endpoint was all-cause mortality, which was used to assess the effect of inclusion of non-cancer-related death on the outcome for the ML models.

Concurrent use of medications (including aspirin, statins, H2 receptor antagonist, proton pump inhibitors [PPIs] and antibiotics) was defined as any use within a period of 30 days before or after the commencement of the first ICI. Regarding antibiotics, it included a total of eleven classes of antibiotics, namely penicillin, cephalosporins, macrolides, carbapenems, quinolones, tetracyclines, aminoglycosides, nitroimidazoles, glycopeptides, sulpha/trimethoprim, and others (nitrofurantoin, rifampicin, and rifaximin). None of these patients received monobactams, clindamycin, linezolid, and daptomycin.

Construction of training models using different machine learning models

For all subjects in the database, case was defined as cancer-related death within 1 year. Control was defined as no cancer-related death within 1 year. A total of six common supervised type of ML algorithms including logistic regression (LR), the least absolute shrinkage and selection operator (LASSO) [33, 34], the extreme gradient boosting (XGBoost) [35], the random forest (RF) [36], stochastic gradient boosting (GBM) [37] and the sparse neural network (NN) [38] were included in the analysis. For the LR, it was characterized by applying a non-linear log transformation to the odds ratio with regression’s range bound between 0 and 1. The LASSO is a type of linear regression performed regularization, shrinkage and variable selection by penalizing for the sum of absolute values of the weight and the absolute values of weight can be reduced up to zero [33, 34]. The RF classifier consisted of a large number of uncorrelated decision tree and the prediction was the prediction of the majority. The error was minimized in RF since a large of tree can cover individual error of a particular tree [36]. The XGBoost was a type of Boosting algorithm via regularized learning objective, gradient tree boosting as well as shrinkage and column subsampling to control overfitting [35]. The GBM constructed additive regression models by sequentially fitting a simple parameterized function to current “pseudo”-residuals by least squares with each training step [37]. A sNNR was a type of neural network with minimizing the square error subject to a penalty on the L1-norm [38] (Supplementary Table 3).

The traditional logistic regression was used as the reference in this study. Missing value plot was used to visualize for any missing data, and all data were manually checked to ensure that there were no missing data in the database. Feature variables were chosen by forward and backward stepwise selection in the training set of the logistic regression. Pair-wise correlations were plotted in the correlation matrix to ensure minimal auto-correlation among the feature variables chosen. The selected variables and outcome were fit into a logistic regression model, and the regression prediction model of the 1-year cancer-related mortality was trained. The accuracy in terms of discriminating power was assessed by an independent validation set via the area under the receiver operating curve (AUC). Variable importance analysis was done using “filter” approach. A loess smoother was fit with the outcome and each variable. The R2  statistic was calculated for each model against the intercept only null model. Relative percentage of R2 statistic for each variable was returned as a relative measure of variable importance for that particular variable in a given model. Shapley value was calculated to assess the effect of individual significant feature on the machine learning model.

Ten-fold cross-internal validation was used to choose the best prediction model for algorithm that required tuning (NN, RF, GBM and XGBoost). The training data were randomly divided into 10 subsets: nine were used for training the machine learning model and the remaining one was used for internal validation. This procedure was repeated 10 times with different validation subsets. The best model in each algorithm with the best parameters tuned was subsequently validated by the external validation set.

Comparison of different machine learning models and clinical score

The AUC of different models and clinical scores were compared using Delong’s test and the best cutoff point for each model was estimated by Youden’s method to calculate the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR) and negative likelihood ratio (NLR) for each model. Multiple comparisons were adjusted by Bonferroni correction.

Time-to-event XGBoost survival model

The training set was also input into XGBoost survival model with time-to-event analysis. The concordance index and Brier score were calculated to compare the prediction of time to event under the trained XGBoost survival model with the ground truth of the validation set.

Application of machine learning models and clinical scores to external validation cohort

To further validate the trained ML models, another external validation cohort of HCC patients who were treated with immunotherapy between January 2020 and March 2021 was retrieved from the electronic patient record (ePR) systems of the Queen Mary Hospital. The ePR system is a comprehensive patient data system used for daily clinical management of patient, which includes patient’s demographics, current and past diagnoses, admission and consultation, procedures, imaging and laboratory test results, current and past medication, smoking and drinking habit, etc.

Different ML models and clinical scores were applied to the external validation cohort and their performances were compared in terms of AUC, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR) and negative likelihood ratio (NLR) for each model.

All statistical analyses were performed by the R statistical software (version 3.2.3; R Foundation for Statistical Computing). The R library “caret”, “xgboost”, “gbm”, “glmnet” and “sNNR” were used for machine learning. Continuous variables were expressed as mean and SD. The Mann–Whitney U test or Wilcoxon rank-sum test was used to compare continuous variables between two groups. The chi-square test or Fisher’s exact test, where appropriate, was applied for comparing categorical variables.

Results

Patients

A total of 395 patients with advanced hepatocellular carcinoma who had received immunotherapy between January 2014 and December 2019 were included. There were 335 (84.8%) male with mean age of 60.6 years. The most common cause of HCC was related to HBV (75.6%) and the overall 1-year cancer-related mortality was 51.1%. The dataset was randomly divided into two sets with 316 patients (80%) assigned to the training set and the remaining 79 patients (20%) assigned to be the independent validation set. The characteristics of patients in the training and validation sets are summarized in Table 1.

Table 1 Characteristics of the training set, internal validation set and external validation cohort

Performance of different machine learning models

The performances of different machine learning models are summarized in Fig. 1 and Table 2. Of all the ML models, the RF had the best AUC of 0.92 (95% CI: 0.87–0.98) which was significantly higher than LR (0.82, p = 0.01). Statistical comparisons among different models are shown in Supplementary Tables 4–10.

Fig. 1
figure 1

The AUC of different machine learning models and clinical scores with the independent validation set

Table 2 Diagnostic performances of different machine learning models on the internal validation set

The Lasso, XGBoost and NN model had the highest sensitivity of 0.91. Their sensitivities were significantly higher than LR (Lasso 0.91 vs LR 0.75, p = 0.05; XGBoost 0.91 vs LR 0.75, p = 0.05; and NN 0.91 vs LR 0.75, p = 0.03; Supplementary Table 5). In contrast, the RF model had the best specificity, PPV and PLR (Supplementary Table 7 and 9). Likewise, the decision curve analysis showed the net benefits of RF, NN and GBM surpassed that of the LR throughout the threshold ranges (Supplementary Fig. 1). The concordance index was 0.81 (95% CI 0.81–0.81) with Brier score 0.21 (95% CI 0.20–0.21) for the independent validation set in the XGBoost survival model using time-to-event analysis.

Comparison of machine learning models with CRAFITY score and ALBI score

All machine learning models performed better than the CRAFITY score in terms of AUC in the validation set (Table 2). Moreover, the specificity, PPV and PLR of all machine learning models were better than the CRAFITY score (Table 2).

For the ALBI score, only RF performed significantly better than the ALBI score in terms of AUC (RF: 0.92 vs 0.84, p = 0.04). However, the XGBoost performed significantly better than ALBI in terms of sensitivity (0.91 vs 0.80, p = 0.05), NPV (0.87 vs 0.72, p = 0.05) and NLR (0.12 vs 0.28, p = 0.05), whereas GBM performed significantly better than ALBI in terms of specificity (0.86 vs 0.71, p = 0.04), PPV (0.88 vs 0.78, p = 0.04) and PLR (5.74 vs 2.72, p = 0.04).

Subgroup analyses

Subgroup analyses were performed to evaluate the performances of different ML models (Supplementary Table 11). For patients ≥ 65 years, the GBM, RF and XGBoost were significantly better than the CRAFITY score in terms of AUC. For patients < 65 years, the LG, LASSO, RF, GBM and NN were significantly better than the CRAFITY score; and LASSO and NN were significantly better than ALBI.

For patients with immunotherapy as first line, RF, GBM and NN were better than CRAFITY score, but only RF performed better than ALBI. For patients with Child’s A grading, the LASSO, XGBoost, RF, GBM and NN were better than CRAFITY score; and the XGBoost, RF and NN were better than ALBI score. For patients with viral cause of HCC, the LASSO, XGBoost, RF, GBM and NN were significantly better than CRAFITY score. For patients with cirrhosis, the RF and GBM performed better than CRAFITY score. For patient without cirrhosis, the RF, XGBoost, Lasso and GBM performed better than CRAFITY score.

Risk factors associated with 1-year cancer-related mortality by different models.

The LR model identified six factors that were associated with 1-year cancer-related mortality in HCC patients including high bilirubin, AFP, ALP and platelet level, and the presence of cirrhosis. The use of pembolizumab was found to be associated with lower mortality risk (Table 3).

Table 3 Comparison of factors selected by forward and backward selection of logistic regression model between high risk (cancer-related death within 1 year) and low risk (survived at 1 year) groups

Figure 2 revealed the risk factors used by different machine learning models in predicting 1-year cancer-related mortality. Baseline high AFP, bilirubin and ALP were three common risk factors identified by all ML models. Presence of infection, use of antibiotics, low albumin, high ALT, high GGT, high AST, high PT, high WBC and high platelet count were identified by five ML models (Lasso, Xgboost, RF, GBM and NN). At least three models noted the use of pembrolizumab and statin as protective factors (Fig. 2 and Supplementary Figs. 2–5). Four models identified Child’s B or C cirrhosis as risk factor. Other factors included older age, presence of cirrhosis and the history of hepatectomy.

Fig. 2
figure 2

Significant factors associated with 1-year cancer-related mortality identified by different ML models

Predictor analysis by the best machine learning model (Random Forest)

By estimation using the Stapley value for each categorical predictor used in the RF model, the adjusted relative risk for the presence of infection was 3.49 (95% CI 2.57–4.47), use of antibiotics was 2.65 (95% CI 2.41–2.91), Child’s B or C cirrhosis was 3.17 (95% CI 2.91–3.46), presence of cirrhosis was 3.85 (95% CI 3.47–4.26), prior hepatectomy was 3.88 (95% CI 3.43–4.38), use of pembrolizumab was 0.43 (95% CI 0.40–0.48), and use of statin was 0.53 (95% CI 0.46,0.60) (Supplementary Figs. 6–12).

For continuous predictors, there was positive adjusted correlation of the Stapley value for ALP, ALT, AST, GGT, PT, WBC, platelet count, AFP, bilirubin; and negative correlation for albumin and sodium level (Supplementary Figs. 13–23).

False positive and false negative rates

For the whole cohort, the LR model falsely predicted 37.7% (n = 39) of patients who would die and falsely predicted 20.3% (n = 76) to survive in 1-year. Of all models, RF had the lowest false positive rate (2.0%) and false negative rate (5.2%). All advanced ML models, except GDM, performed better than LR in terms of false positive rates (Table 4). All models were also superior to LR in terms of false negative rates.

Table 4 False positive and false negative rates of various ML models

All-cause mortality as outcome

Using the same predictors, the accuracy of all ML models was not significantly different when the outcome was changed to all-cause mortality. The AUC of LR for all-cause vs cancer-related mortality was: 0.85 (95% CI 0.76–0.94) vs 0.82 (95% CI 0.73–0.91), p = 0.65; for LASSO was 0.88 (95% CI 0.81–0.96) vs 0.87 (95% CI 0.78–0.95), p = 0.98; for XGBoost was 0.83 (95% CI 0.73–0.94) vs 0.89 (95% CI 0.81–0.97), p = 0.34; for RF was 0.90 (95% CI 0.84–0.97) vs 0.92 (95% CI 0.87–0.98), p = 0.65; for GBM was 0.87 (95% CI 0.79–0.96) vs 0.89 (95% CI 0.82–0.96), p = 0.77; for NN was 0.86 (95% CI 0.78–0.95) vs 0.89 (95% CI 0.83–0.96), p = 0.58.

Validation by the external validation cohort

A total of 43 (74.4% male, mean age 69.0 years) patients with HCC were retrieved from ePR system. The most common cause of HCC was related to HBV (76.7%). There were considerable differences in baseline characteristics between this cohort and the training set (Table 1). In particular, the overall 1-year cancer-related mortality was significantly lower in this external validation cohort than the training set (23.3% vs 51.1%, p < 0.01;). Despite the difference in baseline characteristics, both RF and LASSO still had excellent performance in terms of AUC (RF: 0.91; LASSO: 0.93) (Fig. 3), which were significantly better than CRAFITY score (0.91 vs 0.66, p < 0.01; 0.93 vs 0.66, p < 0.01) and LR (0.91 vs 0.72, p < 0.01, 0.93 vs 0.72, p < 0.01; Supplementary Table 12).

Fig. 3
figure 3

The AUC of different machine learning models and clinical scores with the external validation cohort

Discussion

While immunotherapy appears to be a promising new treatment option for patients with advanced HCC, there was no reliable predictor of treatment outcome including mortality. This is the first study to evaluate the use of the latest advanced ML models which were based on baseline clinical characteristics to predict treatment outcome. We showed that the use of ML models could accurately predict the 1-year mortality of advanced HCC patient with an absolute error within 5%. In particular, both RF, GBM and NN models had high AUC in cancer-related mortality risk prediction (0.92, 0.89 and 0.89, respectively). Moreover, these advanced ML models were also significantly better than traditional LR model in terms of AUC.

Of various ML models, the RF model was found to have the best performance in the current study. Although the algorithm used by ML is usually considered to be “black box”, we performed factor analysis to gain more insights into the factors used in predicting 1-year cancer-related mortality (Fig. 2). Our analysis showed that the RF model used 18 independent variables including baseline ALP, bilirubin, AST, AFP, albumin level, GGT, ALT, sodium level, prothrombin time, white blood cell count and platelet count, Child’s score, history of hepatectomy, presence of infection, uses of antibiotics, statin and pembrolizumab to predict cancer-related mortality. By identifying these factors, it could help the investigators to understand the complex interaction between clinical variables beyond the traditional statistical methods, e.g., logistic regression model. For example, some of these advanced ML models identified use of statin and antibiotics as risk factors, which were not revealed by traditional linear models. Moreover, unlike conventional LR which usually adopt linear association, these ML models could interpret association between clinical variables that interact in a non-linear fashion.

There were three categories of factors that were associated with survival of cancer patients. First, tumor factors, such as the size and number of the tumor nodules [39], tumor involvement of nearby structures, such as vasculature or nearby organs [40, 41], expression of immune checkpoint receptors [42], and previous interventions, were known to have significant impact on the survival [43]. Interestingly, we found that prior treatment could have different impacts on different ML models. By our predictor analysis, we found that in the LASSO, XGBoost and RF models, previous hepatectomy impacted the cancer-related death in a negative way. Previous TACE would impact the cancer-related death in NN model and previous ablative therapy did significantly impact GBM model. The Stapley value of previous hepatectomy was 3.3 times higher than those who did not have prior hepatectomy in the best performance model (RF). Second, liver function tests, such as serum bilirubin, prothrombin time, albumin level, AST level, presence of ascites and hepatic encephalopathy, were demonstrated to be influencing the prognosis of HCC patients [44, 45]. Third, patient’s demographics, e.g., age and sex, and performance status such as the Eastern Cooperative Oncology Group Performance status (ECOG PS) were shown to be important prognostic factors for HCC [44, 45].

There were existing scoring systems for HCC, such as the Barcelona Clinic Liver Cancer score [46] and the Hong Kong Liver Cancer Staging score, which determine the prognosis of HCC patients and guidance of intervention of HCC patients [47, 48]. The AUC of these scoring system ranged from 0.72 to 0.78 in predicting 12-month survival [49]. However, these scoring systems require comprehensive imaging data and even histology information, and these systems are not limited to advanced HCC. Furthermore, we found that the RF model was superior to the ALBI score in the testing date set and performed better than the CRAFITY score in the external validation cohort.

There were unique advantages of the current study. Our study included a large cohort of advanced HCC patients treated with immunotherapy in Hong Kong, which could provide a novel and timely observation on the 1-year survival risk prediction of these patients. We also demonstrated the potential differences in the performance of various advanced ML models when applied to address a specific clinical condition. Moreover, our factor analysis helped to uncover the “black box” nature of these advanced ML models.

However, there are limitations of this study. First, despite being a territory-wide study, the sample size was limited as immunotherapy was still a novel treatment option for advanced HCC patients. Moreover, the best immunotherapy regime for advanced HCC patients had not been properly defined during the study period and no standardized regime was used in all patients. Despite the accuracy of ML models in the prediction of cancer-related mortality after immunotherapy, treatment decision could be influenced by many other factors including patient’s willingness to be treated, side effects of the medications and reimbursement issue. This study could only serve as a pilot study demonstrating the feasibility of using advanced ML models in guiding this decision process. Second, our cohort included mainly Chinese patients with HBV-related cancer and our results may not be applicable in other population with predominantly non-HBV-related HCC. We have, however, performed subgroup analysis, which demonstrates similar performances in patients with viral or non-viral causes of HCC (Supplementary Table 11). The trained ML models were also further tested in an external validation cohort of HCC patients with quite different baseline characteristics and even 1-year mortality, which still yield favorable results. Further, multi-country studies are needed to validate the performance of ML models for this purpose.

Conclusion

We have demonstrated that advanced ML models, especially the RF model, were more accurate than traditional LR and conventional risk scores in predicting 1-year cancer-related mortality of advanced HCC patients using immunotherapy. The use of ML models can potentially help to select patients who would most benefit from this emerging treatment for advanced HCC.