Introduction

With the improvement and wide use of imaging examinations, gallbladder (GB) masses have been detected in approximately 5% of the global population, presenting as polypoidal intraluminal masses or space-occupying masses in GB imaging [1, 2]. Non-neoplastic and neoplastic lesions (GB carcinomas and adenomas) constitute the disease spectrum and differ in risk and prognosis [1, 3]. GB carcinomas are associated with high mortality rates with high aggression, early metastasis, and poor prognosis without specific symptoms at an early stage [4, 5]. Early identification and adequate surgical intervention for neoplastic lesions are vital for preventing the occurrence or progression of GB carcinomas [6, 7].

Currently, common guidelines recommend cholecystectomy for GB lesions sized over 10 mm [8, 9]. However, numerous non-neoplastic lesions are sized over 10 mm, and follow-up may be an alternative to unnecessary surgical intervention [10]. The size of some GB neoplastic lesions is also below this threshold [9]. Moreover, in terms of management, laparoscopic or simple cholecystectomy is appropriate for benign GB lesions. For GB carcinomas, comprehensive assessment is required to determine a specific treatment strategy, such as radical cholecystectomy or extensive resection [2, 11]. Therefore, accurate preoperative risk discrimination of GB neoplastic from non-neoplastic lesions or GB malignant from benign lesions is a prerequisite for decision-making.

Conventional ultrasound (US) is the first-line imaging modality used for the diagnosis of GB diseases [12, 13]. Several previous studies have revealed some US features were associated with GB neoplastic or malignant lesions; however, the universally accepted risk factor was the size of the lesion [14,15,16]. Nevertheless, the risk stratification of GB lesions using only the visual interpretation of US images is challenging [17]. Contrast-enhanced ultrasound (CEUS) is an additional imaging technology for differentiating GB diseases by providing vascularity and perfusion information [18,19,20]. Some studies have reported that perfusion features and GB wall integrity in CEUS are useful in discriminating malignant from benign GB lesions [21,22,23]. However, high empirical dependence, increased cost, low availability, and potential contraindications to US contrast agents limit the clinical application of CEUS [24].

Recently, radiomics has drawn widespread attention for medical image recognition tasks [25]. Radiomics provides quantitative and comprehensible features extracted through computerised algorithms, which can serve as imaging biomarkers beyond the visual interpretation of human beings [26,27,28]. A few studies have applied radiomics approaches to identify true and pseudo-GB polyps or neoplastic GB polyps [29, 30]. By taking advantage of its ability to process quantitative image information, machine learning (ML)–based computational methods have been introduced to improve the diagnostic accuracy for disease classification [31].

Combining the extracted quantitative features and ML-based computational methods may yield a promising effect in the risk stratification of GB masses. Thus, the aim of our study was to investigate the diagnostic performance of ML-based US radiomics models in the risk discrimination of GB masses from two clinical perspectives: (1) discrimination of neoplastic from non-neoplastic lesions, and (2) discrimination of malignant from benign lesions (both neoplastic and non-neoplastic lesions). Furthermore, we investigated whether ML-based US radiomics models could overmatch CEUS in diagnostic performance.

Materials and methods

This multi-institutional study was approved by the ethics committee of the institution (No: SHSYIEC–4.1/21–263/01; 2022-187R), and informed consent was obtained. The prospective protocol is registered at www.chictr.org.cn (ChiCTR2200056165).

Patients

Between August 2019 and October 2022, 609 consecutive patients were enrolled from Institution 1 (Shanghai Tenth People’s Hospital) and Institution 2 (Zhongshan Hospital, Fudan University) as a model development dataset, following the inclusion criteria: (a) patients with non-mobile GB masses as observed during conventional US examination within 1 month before surgery and (b) those who underwent surgical resection and had pathologically confirmed GB masses. The exclusion criteria for patients were as follows: (a) surgically diagnosed with GB stones (n = 11); (b) incomplete clinical data (n = 25); (c) poor quality of US images (n = 30); (d) complete loss of GB structure and inability to distinguish the lesion from adjacent liver tissue, which could not delineate the lesion outline manually as a region of interest (ROI) (n = 4); and (e) other carcinomas of the digestive system simultaneously, to avoid the influence on carbohydrate antigen 19-9 (CA19-9) and carcinoembryonic antigen (CEA) levels (n = 3). The largest lesion was selected as the target in cases with multiple lesions. Eventually, 536 patients with 536 GB masses were prospectively enrolled and randomly divided into training and validation sets in a 7:3 ratio.

Following the same criteria used for the model development dataset, 56 consecutive patients with 56 GB masses from Institution 3 (First Affiliated Hospital of Wenzhou Medical University) and 48 consecutive patients with 48 GB masses from Institution 4 (First Hospital of Ningbo University) were prospectively enrolled as two independent external test sets (A and B, respectively) between June 2020 and October 2022.

Among the enrolled 640 patients, 95 underwent additional CEUS examination. Excluding five patients with poor image quality of CEUS, 90 patients with 90 GB masses were enrolled as a “comparison” set to compare the diagnostic performance between the ML-based US radiomics model and the CEUS model. A detailed flowchart of the patient selection process is presented in Fig. 1. The basic demographic and clinicopathological characteristics of the patients were collected from the medical record systems (Table 1).

Fig. 1
figure 1

Flowchart of patient selection in this study. US, ultrasound; CEUS, contrast-enhanced ultrasound

Table 1 The demographic data of 640 patients from four institutions

Conventional US and CEUS data acquisition

The details of conventional US and CEUS examinations are provided in Supplementary Material S1. All US and CEUS data in the Digital Imaging and Communications in Medicine format were stored in the respective imaging system platform.

US imaging segmentation and radiomics features extraction

The greyscale US imaging segmentation was processed using the ITK-SNAP 3.6.0 programme (http://www.itksnap.org) [32]. The ROI was manually segmented by two radiologists (with 7 and 8 years of experience in abdominal US, respectively), detailed in Supplementary Material S2.

For each ROI, 1070 radiomics features (Supplementary Material S3) were extracted using the IFoundry software (Intelligence Foundry 1.2, GE Healthcare).

Key US radiomics features selection and construction of the ML-based US radiomics models

The intraclass correlation coefficient (ICC > 0.8), t-test, Spearman’s correlation (r < 0.8), and least absolute shrinkage and selection operator were used to select the key US radiomics features. Eleven ML classifiers (Supplementary Material S4) were used with the selected radiomics features to construct 11 ML-based US radiomics models. The fivefold cross-validation was used to adjust the structural parameters [33]. The best-performing ML classifier with the highest area under the curve (AUC) in the validation set was regarded as the optimal ML-based US radiomics model (Fig. 2).

Fig. 2
figure 2

Study workflow of the US, CEUS, US R, and US R + C models for risk stratification of gallbladder masses. US, ultrasound; CEUS, contrast-enhanced ultrasound; US R, ultrasound radiomics model; US R + C, ultrasound radiomics incorporated clinical characteristics model; GB, gallbladder; ROI, region of interest

Construction of the ML-based US radiomics incorporated clinical characteristics models

Logistic regression analysis was used to identify independent predictors of clinical characteristics, including sex and age of the patients, presence of gallstone, and levels of CA19-9 and CEA. The independent clinical predictors were then combined with the optimal US radiomics models to construct the ML-based US radiomics-incorporated clinical characteristics models (Fig. 2).

Conventional US and CEUS data analysis and construction of the conventional US and CEUS models

A radiologist with more than 10 years of experience in abdominal US and CEUS analysed the conventional US images of 640 patients. After a buffer period of 1 week, the CEUS cines of 90 patients were analysed by the same radiologist. The details of data analyses are provided in Supplementary Material S5.

Conventional US and CEUS models for GB mass prediction were constructed using logistic regression analysis with a fivefold cross-validation.

Statistical analysis

Data analyses were performed using Python (version 3.8.8) and SPSS software (version 22.0; IBM Corporation). Normality was tested using the Kolmogorov–Smirnov test. Continuous variables were compared using the t-test and described as mean and standard deviation. Categorical variables were compared using the chi-square or McNemar’s test. The diagnostic performance was evaluated using AUC. DeLong’s test was used to assess differences between AUCs. Statistical significance was set at a p value of less than 0.05.

Results

Demographic and clinicopathological characteristics of patients

Eventually, there were 375, 161, 56, and 48 patients in the training set, validation set, and test sets A and B, respectively (Table 1). In the training set, elevated CA19-9 levels, elevated CEA levels, mean age, sex, and presence of gallstone were significantly different between patients with GB carcinomas and those with benign lesions. Elevated CA19-9 levels, elevated CEA levels, mean age, and presence of gallstone were statistically different between patients with neoplastic lesions and those with non-neoplastic GB lesions (all p < 0.05) (Supplementary Material S6).

Construction and validation of the ML-based US radiomics models

To discriminate between neoplastic and non-neoplastic GB lesions, 61 radiomics features in the training set were selected as key features after dimension reduction (Supplementary Material S7). The extreme gradient boosting (XGBoost)–based US radiomics model showed the highest AUC of 0.837 compared with the other 10 ML algorithms in the validation set (Supplementary Material S8). In the two external test sets, the model showed AUCs of 0.822 and 0.853 (Table 2). Examples of cases correctly diagnosed by the XGBoost-based US radiomics model are shown in Fig. 3a–c. There were 67, 32, and 24 patients with non-neoplastic GB lesions sized over 10 mm in the validation set and test sets A and B, respectively. Among them, the XGBoost-based US radiomics model distinguished 65, 31, and 21 non-neoplastic GB lesions and potentially reduced unnecessary cholecystectomy rate significantly in a speculative comparison with the current guidelines (3.1%, 3/97 vs. 53.6%, 52/97 for validation set, p < 0.001; 2.7%, 1/37 vs. 64.9%, 24/37 for test set A, p < 0.001; and 13.8%, 4/29 vs. 62.1%, 18/29 for test set B, p = 0.001).

Table 2 The diagnostic performance of models in discrimination of the GB neoplastic lesions from non-neoplastic ones
Fig. 3
figure 3

Examples correctly diagnosed by ultrasound radiomics models. a A case of a 48-year-old woman with a cholesterol polyp. b A case of a 76-year-old man with an adenomyomatosis. c A case of a 61-year-old woman with an adenoma. d A case of a 72-year-old woman with a tubular adenoma. e A case of a 63-year-old woman with a gallbladder carcinoma. f A case of a 59-year-old woman with a gallbladder carcinoma. The lesions are indicated by red arrows

To discriminate GB carcinomas from benign lesions, 33 critical US radiomics features were reserved using the same feature reduction and selection methods (Supplementary Material S7). The XGBoost-based US radiomics model showed the highest AUC of 0.904 in comparison to the other 10 ML algorithms in the validation set (Supplementary Material S8). In the two external test sets, the model showed AUCs of 0.909 and 0.979 in test sets A and B, respectively (Table 3). Example cases diagnosed correctly using the XGBoost-based US radiomics model are shown in Fig. 3d–f.

Table 3 The diagnostic performance of models in discrimination of GB carcinomas from benign lesions

Comparison between the conventional US models and the ML-based US radiomics models

The largest diameter (≥ 16.1 mm) was identified as an independent predictor of neoplastic GB lesions in the training set (Supplementary Material S9). The diagnostic performance of the XGBoost-based US radiomics model was superior to that of the conventional US model based on the largest diameter in terms of AUCs (0.822–0.853 vs. 0.642–0.706, all p < 0.05) (Table 2 and Fig. 4a–c).

Fig. 4
figure 4

The receiver operating characteristic curves of different methods in the validation (a), test A (b), test B (c), and comparison (d) sets. CUS, conventional ultrasound; CEUS, contrast-enhanced ultrasound; US R, ultrasound radiomics model; US R + C, ultrasound radiomics incorporated clinical characteristics model; GBC, gallbladder carcinoma; GB neo-, neoplastic gallbladder lesion

The largest diameter (≥ 21.0 mm) and discontinuity of the GB wall were independent predictors of GB carcinomas (Supplementary Material S9). The diagnostic performance of the XGBoost-based US radiomics model was superior to that of the conventional US model based on the largest diameter and discontinuity of the GB wall in terms of AUCs (0.904–0.979 vs. 0.706–0.766, all p < 0.05) (Table 3 and Fig. 4a–c).

Comparison between ML-based US radiomics-incorporated clinical characteristics models and ML-based US radiomics models

Elevated CA19-9 levels (≥ 27 U/mL) and age (> 59 years) were independent clinical predictors of neoplastic GB lesions in the training set. Likewise, elevated CA19-9 levels (≥ 27 U/mL), presence of gallstone, female, and age (> 59 years) were predictors of GB carcinomas (Supplementary Material S10). For the discrimination of neoplastic GB lesions and GB carcinomas, after adding independent clinical predictors, no significant difference in AUCs (all p > 0.05 in the validation set, and test sets A and B) was observed between the US radiomics-incorporated clinical characteristics models and US radiomics models (Tables 2 and 3; Fig. 4a–c).

Comparison between the CEUS models and ML-based US radiomics models

Compared with benign lesions, GB carcinomas often displayed hyper-enhancement at the early phase (94.1% vs. 49.3%), washed out earlier (33.1 ± 7.6 s vs. 47.2 ± 14.0 s), and discontinuity of the GB wall (64.7% vs. 4.1%) on CEUS (all p < 0.05) (Supplementary Material S11). The optimal cutoff value for washout time was 38 s. These three features were identified as independent characteristics of GB carcinomas (Supplementary Material S12). When discontinuity of the GB wall or washout time within 38 s was considered (Diagnostic criterion 1), AUC was 0.847. When any of the two characteristics appeared (Diagnostic criterion 2), AUC increased to 0.902 (Table 4).

Table 4 The diagnostic performance of CEUS models in discrimination of GB carcinomas from benign lesions in the "comparison" set

In comparison with the Diagnostic criterion 2 on CEUS, the AUC of the XGBoost-based US radiomics model was higher for GB carcinomas prediction (0.995 vs. 0.902, p = 0.011) in the “comparison” set. No significant difference in AUCs was observed between the XGBoost-based US radiomics-incorporated clinical characteristics model and XGBoost-based US radiomics model (AUC: 0.998 vs. 0.995, p = 0.384) (Table 5 and Fig. 4d). Example cases that had atypical manifestations on CEUS but were diagnosed correctly by the XGBoost-based US radiomics model are shown in Fig. 5.

Table 5 The diagnostic performance of models in discrimination of GB carcinomas from benign lesions in the "comparison" set
Fig. 5
figure 5

Examples of cases with atypical manifestation on CEUS diagnosed correctly by the ultrasound radiomics method. ac A case of a 72-year-old woman with a tubular adenoma. a US exhibits a hyperechoic lesion of 31.0 mm (red arrow). b CEUS exhibits hyper-enhancement (red arrow) at 19 s after contrast agent injection and local discontinuity of the gallbladder wall (white arrow). C The lesion turns into hypo-enhancement (red arrow) at 36 s after contrast agent injection. df A case of a 70-year-old woman with a gallbladder carcinoma. d US exhibits a hyperechoic lesion of 15.0 mm (red arrow). e CEUS exhibits hyper-enhancement (red arrow) at 17 s after contrast agent injection. f The lesion turns into hypo-enhancement (red arrow) at 68 s after contrast agent injection and is misdiagnosed as adenoma by CEUS

Discussion

Our multi-institutional and prospective study indicated that the diagnostic performance of the XGBoost-based US radiomics models was significantly superior to that of conventional US models in terms of AUC for the diagnosis of neoplastic GB lesions and GB carcinomas. Meanwhile, the XGBoost-based US radiomics model could potentially reduce the unnecessary cholecystectomy rate in a speculative comparison with the current consensus guidelines (cholecystectomy for lesions sized over 10 mm) for non-neoplastic GB lesions sized over 10 mm. Furthermore, the diagnostic performance of the XGBoost-based US radiomics model could overmatch that of the CEUS model in discriminating GB carcinomas from benign lesions.

GB masses are common clinically, and their diagnosis can be challenging because of conflicting and unclear imaging manifestations [34]. The 2017 European Joint Guidelines recommend cholecystectomy for GB polypoid lesions sized over 10 mm [35]. The threshold setting of the guideline is based on the limitation of risk stratification among these GB lesions using routine imaging modalities to minimise the omission of neoplastic lesions. Nevertheless, this strategy also leads to unnecessary cholecystectomy, which results in potential post-cholecystectomy syndrome and increases the financial burden on patients [10, 36,37,38]. On the other hand, precise and timely diagnosis of malignant and benign GB lesions is a topic of great interest for evaluating the best treatment options and prognosis assessment [11, 39, 40]. Therefore, devising methods to minimise the unnecessary cholecystectomy rate and reduce the missed diagnosis of GB carcinoma as much as possible promises to be of great interest in clinical practice. To the best of our knowledge, our multi-institutional study is the first to develop ML-based US radiomics models for the risk stratification of GB masses from two clinical perspectives: diagnosis of neoplastic lesions and GB carcinomas.

Constantly evolving and improving radiomics approaches are a series of processes that utilise quantitative image analysis to extract features in conjunction with computerised algorithms for feature selection, and finally used for diagnosis or prediction [41,42,43]. We selected interpretable ML classifiers that can effectively exploit the obtainable data. In comparison with the other 10 ML classifiers, our study showed that the XGBoost classifier had the highest diagnostic performance for discriminating both neoplastic GB lesions and GB carcinomas. XGBoost is an ML algorithm that improves the gradient boosting machine framework with system optimisations and algorithm enhancements and is outstanding in parallel computing, missing value processing, overfitting control, and prediction generalisation [44].

Previous studies have reported that size of the lesion (14–16 mm) is a main risk factor for neoplastic GB lesions [15, 17, 45]. The suggested risk factors of GB carcinomas include size of the lesion (20–22 mm), sessile shape, and discontinuity of the GB wall [9, 17, 46]. Similarly, our study found that the largest diameter (> 16.1 mm) was a significant independent predictor of neoplastic GB lesions, and the largest diameter (> 21.0 mm) and discontinuity of the GB wall were independent predictors of GB carcinomas. However, with respect to GB carcinoma discrimination, the accuracy of US is moderate (approximately 70%) [20]. In our study, the accuracy was increased to 91.1–95.8% when the XGBoost-based US radiomics model was used. Choi et al reported that the accuracy for predicting a neoplastic polyp was 66.9–72.1% in high-resolution US [17]. Our XGBoost-based US radiomics model was able to improve the accuracy to 83.2–85.7%.

As a complement to conventional US, CEUS can significantly improve the AUC (approximately 0.90) in discriminating GB carcinomas [20, 47]. Generally, destruction of the GB wall, irregular vascularity, heterogeneous enhancement, and earlier contrast agent wash-out time (28–40 s) are characteristics of GB carcinoma on CEUS [21, 22, 48, 49]. However, certain problems can affect the clinical application of CEUS. It requires the setup of an effective venous channel and some patients may be contraindicated or allergic to contrast agents. Moreover, high empirical dependence and the requirement for high-end US machines usually cause CEUS to be performed in hospitals with appropriate expertise and resources [24]. For ML-based radiomics, digital US images fed to the proposed models have already been routinely acquired without extra effort. This cost-effective approach would help radiologists differentiate benign GB lesions from malignant ones, leading to better utilisation of healthcare resources.

Yuan et al analysed spatial and morphological features and constructed an SVM-based US radiomics model to distinguish true and pseudo-GB polyps with an AUC of 0.898 [30]. Our study extracted more varieties of US radiomics features and utilised different ML algorithms to establish optimal ML-based US radiomics models for more effective risk stratification of GB masses. Several studies have suggested that female, older age, presence of gallstone, and elevated CA19-9 and CEA levels are potential risk factors for GB carcinomas [40, 50,51,52]. In this study, after adding independent clinical predictors, the incorporated US radiomics models did not show significantly superior diagnostic performance compared with US radiomics model alone. A possible reason for this is that these clinical characteristics might be of limited value in the risk prediction of GB masses, in contrast to US radiomics features.

This study had several limitations. First, the sample size was relatively limited. However, the study was produced from a multi-centre dataset and was tested prospectively in two external test sets involving different US scanners, which indicated the reliability of the results. Second, since surgery-confirmed pathology was used as the gold standard, GB lesions sized less than 6 mm were not included, which are commonly not recommended for cholecystectomy [34]. Third, GB lesions which could not be demarcated from the adjacent liver tissue were excluded. These cases lost the GB structure, and the ROI could not be delineated manually. Meanwhile, the study focused on GB masses; other GB diseases presenting only as thickening of the GB wall were not included. Lastly, our proposed models were not compared with current treatment guidelines in the real clinical practice, which need to be conducted further in a randomised clinical trial to validate our results.

In conclusion, our proposed ML-based US radiomics models are capable of preoperatively predicting the risk of GB masses and have the potential to decrease unnecessary cholecystectomy rate, and substitute for the use of CEUS.