Introduction

Cholangiocarcinoma comprises diverse groups of malignant tumors that appear in biliary system and can be anatomically classified into intrahepatic, perihilar and distal cholangiocarcinoma [1]. Although not as common as hepatocellular carcinoma, detected incidence of cholangiocarcinoma, especially intrahepatic cholangiocarcinoma (ICC), has been on rising in many countries, and has become the second most common liver malignancy [2]. Unlike extrahepatic cholangiocarcinoma, ICC rarely causes jaundice and is also usually asymptomatic at early stage [3], which makes early ICC diagnosis difficult. Consequently, only about 30–40% of patients can undergo curative surgical resection [4]. Furthermore, 40–80% of patients experienced recurrence with 44% of them detected in the first year after surgery [5]. In addition, ICC is usually more aggressive and frequently refractory to chemotherapy than HCC, making ICC a deadly cancer, resulting in 5-year survival rate below 30% after surgery [6]. Therefore, an accurate preoperative assessment of ICC patients’ risk of recurrence will help to improve clinical management and prognosis of patients with ICC.

Currently, the AJCC 8th TNM staging system has been used for risk stratification of ICC [7]. However, in clinical practice, postoperative recurrence and prognosis differ markedly even among patients classified at the same stages [6, 8]. The use of existing clinical factors did play a positive role in patient stratification [9]. However, due to the high heterogeneity of ICC, the current clinical staging system still has many limitations, which prompted clinicians to seek more effective predictive tools.

Dynamic enhanced CT is routinely used for detection and evaluation of ICC. However, traditionally, the diagnosis and prognosis assessment are largely subjected to the physician's personal interpretation of images, which inevitably carries non-objectiveness and makes quantitative comparison difficult [10]. Moreover, with the busy workload of radiologists, relying solely on human’s eyes sometimes inevitably leads to oversights. In contrast, radiomics is a novel and integrated discipline that combines artificial intelligence with imaging analysis, which has been increasingly used in oncology practice, such as tumor diagnosis, staging and assessment [11, 12], and is popular for its accessibility and non-invasiveness.

Despite the potential of radiomics, some investigators have urged the field to expand studying the reproducibility and generalizability of models generated, which is also a point that must be considered although radiomics has great potential, some researchers have called for the field to broaden its focus to include investigating the reproducibility and generalizability of the models it generates. This is an important consideration for the practical application of radiomics studies [13, 14]. Since ICC is an uncommon tumor, most of the previous studies were single center based and reported a relatively small number of cases (for instance < 200) [15, 16]. Therefore, a multicenter study with a large sample is necessary. Moreover, different CT instruments and parameters utilized by different centers pose a great challenge to develop an accurate and reproducible prediction system. In this study, we collected large cohorts of ICC data from different regional centers in China and sought to develop and validate a preoperative model for predicting risk of early recurrence (ER) within 1 year after curative resection of ICC through artificial intelligence (AI)-based CT radiomics approach.

Materials and methods

Patients

The study was reported according to the STROBE guidelines. This multicenter study included ICC patients who underwent radical surgical resection between January 2008 and December 2020 at eight hospitals in China. Two inclusion criteria applied for enrollment: (1) contrast-enhanced CT examination was performed within 1 month before surgery; (2) patients underwent surgery with curative intent and ICC was confirmed by postoperative pathology. Patients with following characteristics were excluded: (1) lack of preoperative CT data; (2) previous history of other ICC treatments (including interventional and ablation, etc.); (3) history of other malignancies; (4) lost to follow-up, and whether recurrence occurs within 1 year after surgery was unknown. A total of 311 cases were included in the study (Fig. 1). Patients were divided into three cohorts and each of which was located within each province territory (detailed comprising hospitals were listed in Supplementary Materials). Finally, a derivation cohort was established within the Cohort 1 and randomly divided into a training group (n = 140) and an internal validation group (n = 36) at a ratio of 8:2, and both Cohort 2 (n = 74) and Cohort 3 (n = 61) served as two independent external validation cohorts.

Fig. 1
figure 1

Flowchart of the enrolled study patients

Clinical data and assessments

Following clinical and laboratory data before surgery were collected: basic demographic information, serum levels of alanine transaminase (ALT), aspartate transaminase (AST), albumin, total bilirubin, TNM tumor staging (8th edition), concurrent hepatobiliary system disease (viral hepatitis, cirrhosis, biliary fluke infection, primary sclerosing cholangitis and cholelithiasis), alpha-fetoprotein (AFP), carcinoembryonic antigen (CEA) and carbohydrate antigen 19-9 (CA19-9). The threshold values for ALT, AST, AFP, CA19-9, and CEA used here were 35, 40 U/L, 20 ng/ml, 1000 u/ml, and 5 ng/L, respectively.

Radiographic data and assessments

Preoperative dynamic-enhanced CT (both unenhanced and arterial enhancement phases) images were collected. Details of CT acquisition and parameters are described in (Supplementary Table S1). Two experienced radiologists (JZ and YKX) independently documented tumor number, size, margins, vascular invasion, and lymph node metastases (see supplementary materials for detailed definitions). When disagreement presented, a consensus was made through discussion with a third radiologist.

Image segmentation and feature extraction

Two radiologists (JZ and YKX, both with more than 15 years of experience in abdominal imaging), who were blinded to clinical outcomes, independently performed segmentation of the tumor layer-by-layer, resulting in 3D volume of interest (VOI) using 3D Slicer (version 4.10.2). Then, a peri-tumoral area through a uniform expansion of 5 mm outward from an entire VOI edge was calculated and generated by 3D Slicer (Supplementary Fig S1). The largest one was selected for analysis if there were multiple tumors. To ensure the stability of the extracted features, the inter-observer reproducibility of the extracted features between two readers was evaluated. And images from 50 cases in the derivation cohort were then randomly selected and the same segmentation procedure was repeated 2 weeks later by the same two radiologists to assess the intra-observer reproducibility. Inter-observer and intra-observer reproducibility were evaluated by calculating the infraclass correlation coefficient.

All voxel of images were resampled at a size of 1 × 1 × 1 mm3 for standardization to establish comparable and reproducible radiomics analysis. Then, CT features were extracted from the normalized images and filtered images under Wavelet and Laplacian of Gaussian (LoG) filters using Pyradiomics (version 3.0). The detailed parameters were listed in Supplementary materials. A total of 6296 features covering tumoral and peri-tumoral areas for each case were finally obtained. The features were classified to include seven categories: (1) Shape; (2) First-order statistics; (3) Gray Level Cooccurrence Matrix; (4) Gray Level Run Length Matrix; (5) Gray Level Size Zone Matrix; (6) Neighboring Gray Tone Difference Matrix; (7) Gray Level Dependence Matrix [17].

Feature selection

The features were first normalized by the z-score method to eliminate differences in scale. Then, the procedure of feature selection in the derivation cohort consists of three steps: (1) robust features with the intraclass correlation coefficients (both inter-observer and intra-observer) higher than 0.75 were retained; (2) features were ranked by the maximum relevance minimum redundancy (mRMR) algorithm, and the top 100 features were retained to select the features with the highest correlation and minimum redundancy; (3) recursive feature elimination and cross-validation were performed to determine the best feature subset.

Clinical features (including laboratory tests and conventional imaging features) were selected through univariable and multivariable logistic regression for subsequent model building.

Model construction and validation

The selected features were analyzed by the LightGBM machine learning algorithm to construct two models: (1) Radiomics model; and (2) Combined clinical–radiomics model. A clinical model was also developed by the clinical factors selected by the above univariable and multivariable analysis. Discrimination was quantified with the area under the curve (AUC). Prediction accuracy was evaluated by calibration curves, confusion matrix and Hosmer–Lemeshow test [18]. Net reclassification index and integrated discrimination improvement were used to evaluate the incremental differences of the combined clinical–radiomics models. The LightGBM model was demonstrated using the Shapley Additive exPlanations (SHAP) to visualize the correlation between variables and ER. Patients in each cohort were stratified according to the risk predicted by the combined radiomics model and plotted by Kaplan–Meier analysis. Additionally, we also tried several different machine learning classifiers and found that LightGBM performed the best (Supplementary Table S2). Finally, we compared the combined clinical–radiomics model with AJCC 8th TNM system in predicting ER. Evaluation of TNM tumor staging model was derived using the classical logistic regression model based on different staging grade variables.

Follow-up

Patients were followed up postoperatively at each center per the guideline recommendations [19,20,21,22]. Clinical assessment, and CT or MRI examinations were performed every 3 months during the first year after surgery, and at 6 months interval thereafter. Considering > 40% of ICC patients may experience recurrence in the first year after surgery [5], ER in this study was defined as recurrence within 1 year after surgery. Locations of the recurrence were classified as intrahepatic only, extrahepatic only, or concurrent intra- and extrahepatic for the purposes of analysis. Recurrence-free survival (RFS) was defined as interval between curative-intent surgery and the date of tumor recurrence or the last follow-up date (without recurrence).

Statistical analysis

Statistical analyses were performed by SPSS (version 26), R software (version 3.6.2) and Python (version 3.8). For normal distribution, data are expressed as means ± SD and analyzed by one-way ANOVA. For abnormal distribution, data are described as medians [interquartile ranges, IQR] and analyzed by the Kruskal–Wallis H test. Comparison between different ROC curves was performed by Delong test [23]. mRMR analysis was performed using the ‘mRMRe’ package of R software. The rest of the feature selection and model construction were performed using the Python-based ‘scikit-learn’ machine learning framework [24]. ROC curves, confusion matrices and radar plots were plotted by the Python-based 'Matplotlib' library. For all tests, p < 0.05 was considered statistically significant.

Results

Patient characteristics

A total of 311 ICC patients from 8 different hospitals covering the southern and eastern China were included in this study. Summary of the demographics and clinical baseline characteristics of all patients is listed in Table 1. The incidence of cholelithiasis was relatively more common in the Cohort 3 (65.6% (40/61)), accompanied by relatively high bilirubin levels. The remaining clinical features were not statistically different among centers. In addition, owing to the lack of data on liver flukes and primary sclerosing cholangitis in more than half of the retrospective cases, they were not further included in the analysis. Most patients (87.1% (271/311)) received R0 resection with no significant differences between cohorts. After a median follow-up time of 50 weeks [IQR 28–106 weeks], the median RFS time was approximately 45–50 weeks, and there was no difference in the ER rates (ranging from 47.2 to 54.1%) among the four cohorts.

Table 1 Baseline characteristics of ICC patients in the derivation cohort and 3 validation cohorts (n = 311)

Feature selection and development of radiomics model

Of the 6296 radiomic features, 528 of them were deemed unreliable (intra- or inter-class correlation coefficient < 0.75) and excluded, and then the remaining features were ranked by the mRMR algorithm to retain the top 100 features (Supplementary Fig S2A). The final 15 radiomic features were generated after subjecting to recursive feature elimination + cross-validation (Supplementary Fig S2B). Radiomics model was developed using the tree kernel-based LightGBM algorithm, showing AUCs values of 0.877 (95% CI 0.816–0.938), 0.780 (95% CI 0.627–0.934), 0.751 (95% CI 0.636–0.866) and 0.739 (95% CI 0.611–0.866) in the derivation cohort, internal and two external validation cohorts, respectively.

Development of combined radiomics model

Univariable and multivariable analyses showed that CA19-9 (> 1000 U/ml), vascular invasion and tumor margin were independently associated with ER (Supplementary Fig S3) with variance inflation factors (1.8, 1.7 and 1.5, respectively) all < 10, suggesting no multicollinearity between the variables. A combined radiomics model was developed through incorporating the above three clinical features into the radiomics model using the LightGBM algorithm (see Supplementary Materials for detailed parameters). The categorical net reclassification index and integrated discrimination improvement of the combined radiomics model were improved in all the four cohorts with the addition of radiomics features (Supplementary Table S3). The inclusion of SHAP enables to perform an interpretative analysis of the combined radiomics model through visualizing specific impact of each variable on ER prediction (Fig. 2). Furthermore, each patient can be analyzed individually (Supplementary Fig S4).

Fig. 2
figure 2

Contribution analysis of each feature by using the SHapley Additive exPlanations (SHAP) in the LightGBM model. The features are ranked according to the sum of the SHAP values for all patients. A positive/negative SHAP value indicate a positive/negative correlation between a variable and outcome

Internal and external validation of combined radiomics model

As shown by ROC curves (Fig. 3A–D), the combined radiomics model yielded the highest AUCs of 0.974 (95% CI 0.946–1.000), 0.882 (95% CI 0.672–0.962), 0.871 (95% CI 0.786–0.955) and 0.878 (95% CI 0.797–0.960) in the derivation cohort, internal and two external validation cohorts, respectively. All the four cohorts showed good agreement on the calibration curves (Fig. 4A–D), and the Hosmer–Lemeshow test suggested no significant differences between the predicted probabilities and the actual observed recurrence probabilities (p = 0.988, 0.541, 0.241 and 0.330), supporting the accurate performance by this model. The confusion matrix for each model is shown in Supplementary Fig S5.

Fig. 3
figure 3

Comparison of ROC curves between different models for predicting the risk of early recurrence after curative resection of ICC. ICC cholangiocarcinoma, ROC receiver operating characteristic

Fig. 4
figure 4

Calibration curves in each cohort. The dotted line in the middle diagonal is the ideal perfect situation. The solid line is the predicted performance of the current model, and the vertical bars represent confidence interval. When the solid line is closer to the dotted line, the goodness of fit of the model is better

In addition, comparisons between each ROC curve in the three validation cohorts all showed p value greater than 0.05 (Supplementary Table S4), which also suggested good stability and reproducibility of the combined radiomics model.

Patients in each cohort were preoperatively stratified according to the risk predicted by the combined radiomics model and Kaplan–Meier analysis all suggested good discrimination (Supplementary Fig S6).

Then, we compared the combined radiomics model with the AJCC 8th TNM system, showing that the combined radiomics model performed better in both internal and external validation cohorts (Table 2 and Fig. 5). Especially, the combined radiomics model had a significantly higher sensitivity reaching up to 94.6% on average for all the cohorts. The ROC curves comparison between with the combined radiomics and the AJCC 8th TNM staging showed significant difference (p < 0.05) in all the cohorts by Delong test (Supplementary Table S5).

Table 2 Comparison of prediction performance between Combined radiomics model and AJCC 8th TNM staging
Fig. 5
figure 5

The radar chart visualization of prediction performance for Combined radiomics model and AJCC 8th TNM staging

In terms of practical applications, as shown in Fig. 6, it helps clinicians to identify a subset of patients preoperatively who are at high ER risk and tailor better treatment and follow-up plans, such as early combination of neoadjuvant chemotherapy or shortening the follow-up interval, which would help reduce the risk of ER and improve the prognosis of patients.

Fig. 6
figure 6

Two representative contrast-enhanced CT image examples of ICC. A CT image of a 48-year-old man in arterial phase shows a 7.9 × 4.8 cm nodular lesion with arterial peri-tumoral enhancement (arrow) in hepatic segment IV. Postoperative pathology showed positive vascular invasion and the TNM staging of this patient is II. Using the combined radiomics model, this patient was stratified as low risk and he survived without recurrence until the last follow-up (92 weeks); B CT image of a 74-year-old woman in arterial phase shows a 4.3 × 2.6 cm nodular lesion without obvious arterial enhancement in hepatic segment V/VIII. Postoperative pathology showed negative vascular invasion, and his TNM staging was stage Ia. However, using the combined radiomics model, this patient was stratified as high risk and the tumor recurred 34 weeks after surgery. ICC intrahepatic cholangiocarcinoma, RFS recurrence free survival

Discussion

In this multicenter study, we established and evaluated a combined radiomics model with LightGBM machine learning algorithm for predicting ER after curative resection in ICC. Using this robust model, a total of 149 patients were identified as having high risk of recurrence in all cohorts, accounting for 94.6% of the actual ER patients and only 5.1% patients were missed by this model, which is of great importance for ICC, a tumor with high malignancy.

Several radiomics studies have been reported on the prediction of the prognosis of ICC. Liang et al. [15] developed a nomogram based on preoperative MR images to predict ER of ICC in a single center study with AUCs of 0.90 and 0.85 in the training and validation groups. In the present study, the AUCs reached 0.97 and 0.88 in the training and internal validation cohorts, and still reached more than 0.87 in two independent external validation centers, indicating that both CT and MR image-based radiomics models can achieve comparable predictive efficacy, and CT has more economical and temporal advantages. Also as a preoperative CT image-based study, Park et al. [25] segmented the tumor on the single transverse section and developed a model by LASSO-COX regression, yielding the C-index with 0.75 and 0.71 in the training and validation groups, which was not significantly different from the existing TNM staging system (C-index: 0.73). In our study, the combined radiomics model outperformed the TNM system in both internal and external validation, highlighting two factors for improving the ability for ER prediction: one is the 3D VOI segmentation and peri-tumoral area included, and the other is the use of LightGBM-based machine learning algorithm to construct the model. Recently, Zhu Y et al. also predicted early recurrence of ICC based on a CT radiomics approach, with the AUCs of 0.844 (95% CI 0.751–0.912) and 0.793 (95% CI 0.617–0.914) in the training and validation groups [16]. Admittedly the model performed well, but its AUCs still slightly lower than our model. And the study by Zhu Y et al. was also based solely on the transverse section of tumor maximum diameter for segmentation and its relatively small number of cases (only 125) should also be noted.

The variables included in the model of this study included 15 radiomic features and 3 clinical features. It is worth noting that 93.3% (14/15) of the radiomics features were 3D and wavelet features, which is consistent with the results of previous studies [26, 27] and indicates that tumor heterogeneity can be reflected as differences in the spatial distribution of voxel intensities in images, which is one of the assumptions of radiomics [28]. After univariate and multifactorial regression analyses, three variables were finally selected in terms of clinical features: CA19-9 (> 1000 U/ml), tumor vascular invasion and tumor margin, where the latter two variables were consistent with several previous studies and were considered as one of the independent risk factors for postoperative recurrence [29]. Whether CA19-9 can serve as an independent risk factor used for the prediction of recurrence is controversial. Park et al. [25] and Jeong et al. [9] concluded that tumor markers including CA19-9 and CEA were not significantly associated with postoperative recurrence, while the studies of Wang et al. [29] and Ji et al. [30] found that CA19-9 (> 1000 U/ml) was significantly associated with RFS (p < 0.05). In addition, Roongruedee et al. [31] also found that CA19-9 > 1000 U/ml was the optimal threshold for predicting the prognosis after resection by comparing different thresholds of CA19-9 (≥ 100, ≥ 500, ≥ 1000 U/ml), and these findings are also consistent with the present study. Finally, it is of interest that although tumor size was associated with ER in univariate analysis, it did not show significant statistical differences in multivariate analysis (p = 0.146). In AJCC 8th staging, T1 stage was divided into T1a and T1b stages using tumor diameter > 5 cm as the cut-off value, while several studies also suggested that tumor diameter > 5 cm was an independent risk factor for tumor recurrence and survival [32, 33]. However, the aforementioned studies mainly focused on patients with solitary ICC without vascular invasion, i.e., in stage T1, whereas in our study, the staging of the included ICC patients varied and was not limited to stage T1, suggesting that tumor diameter size may not be a major influencing factor associated with ER after covering all stages of ICC.

In real world, variables may interact in both linear and non-linear fashions. Traditional linear models used in previous studies often yielded poor performance in real-world testing. However, machine learning can capture complex linear and non-linear relationships between variables. In this study, the LightGBM algorithm was selected for its superior performance after comparing different algorithms. As an emerging algorithm born in 2017, LightGBM is also an algorithm based on ensemble learning. In addition to its high accuracy, LightGBM also has the advantages of supporting distributed learning, faster training speed and lower memory usage. It has won the Kaggle data science competitions several times and has been practically applied in the finance and power industries [34, 35]. In this study, LightGBM delivered efficient performance, the average training time takes < 30 s even after 100,000 times of cross-validation, which greatly enhances the possibility of practical clinical applications in the future.

Derived from Gdbt, LightGBM has better accuracy in prediction compared to traditional linear models, but it also loses the interpretability of linear models, and is usually considered a black box model. In 2017, Lundberg and Lee [36] proposed SHAP values, a unified approach to interpreting model predictions, thus making it possible to interpret black box models such as LightGBM and neural networks possible. This new approach not only allows us to understand the intricate relationships between the predictors and the target variables, but also helps to explore in depth the relationship between these large number of mathematically transformed features in radiomics and gene expression in life science.

There are several limitations in this study. First, considering the long appointment time for MR and the convenience available for CT, we chose CT images for analysis, but the use of MR has also attracted increasing attention in hepatobiliary diseases, especially MRCP and MR-specific sequences [37, 38]. Second, we selected non-contrast and arterial phase-based images for analysis in this study and, theoretically, although there might be a decrease in predictive performance compared to the full-phase CT image model. Surprisingly, the predictive model we developed also achieved relatively superior performance compared to related studies (AUC = 0.871–0.974), which can be attributed to the relatively large sample size in our study and the machine learning algorithm. In the future, we also plan to perform feature extraction and analysis based on the full-phase CT. Third, considering the accuracy of image segmentation, we used manual segmentation of ROI, but it inevitably takes more time. With the development of AI technology, the inclusion of automatic segmentation may further improve the practical clinical application in the future. Fourth, the present study only evaluated 311 Chinese ICC patients and was unable to evaluate different ethnic patients. Further validations of the main findings by international multicenter are required.

In conclusion, we used the LightGBM machine learning algorithm to construct the combined radiomics model for ICC ER prediction. The combined radiomics model showed highly sensitivity and detected 94.6% of the ICC patients who actually developed ER for all the cohorts and delivered a better performance to TNM staging system. This model, once further validated, may provide a new clinical tool to improve clinical management of ICC patients.