Background

Lung malignancy is the second most common cancer and one of the most deadly cancers in the world [1], in which non-small cell lung cancer (NSCLC) occupies around 85% [2]. Accurate staging plays a crucial role in guiding treatment strategies and predicting prognosis in the management of NSCLC [3]. Nevertheless, staging NSCLC remains challenging due to the inability of conventional imaging modalities to meet clinical requirements [4].

Low-dose computed tomography (LDCT) is the most common modalities recommended for cancer screening. It has been shown to significantly reduce mortality from lung cancer [5]. Nevertheless, LDCT has limited diagnostic performance in identifying tumor metastasis. To complement LDCT, additional examinations, such as 18F-FDG-PET/CT or endobronchial ultrasound-guided transbronchial needle aspiration are recommended [6]. Among these additional examinations, 18F-FDG-PET/CT is the preferred modality for diagnosing distal metastasis (DM) in NSCLC due to high accuracy [7]. However, it comes with the drawback of being costly and time-consuming due to the need for a whole-body scan. This creates a clinical demand for convenient, economical, and reliable non-invasive imaging parameters that can improve preliminary screening for DM in NSCLC patients [8].

Recently, there has been a growing interest in utilizing radiomics and deep learning (DL) techniques to analyze medical images. These methods have demonstrated their ability to learn and decipher the representative radiologic phenotypes of tumors [9,10,11,12]. Notably, previous studies have highlighted their success in predicting lymph node metastasis in various cancers, including lung [13, 14], breast [15], gastric [16], and thyroid [17]. Despite these achievements, both radiomics and DL methods have shown limited efficacy in predicting DM in patients with NSCLC [18]. Moreover, the learning pattern and mechanism underlying the prediction of these methods remain unclear. Unraveling these mechanisms could contribute to enhancing the practical applicability of artificial intelligence in real clinical settings.

Herein, the aim of this study was to develop and validate LDCT-based radiomics and DL signatures to improve the preliminary screening for DM in patients with NSCLC using LDCT images. Furthermore, we sought to investigate the learning pattern and mechanism involved underlying these prediction methods.

Materials and methods

Study population

This study was approved by our institutional review board. For patients in the retrospective cohort, the written informed consent was waived while those of patients in the prospective cohort were obtained in this study. Patients who underwent the LDCT examination and 18F-FDG PET/CT scan from November 2017 to July 2020 were retrospectively recruited from the Fifth Affiliated Hospital of Sun Yat-Sen University. The included patients should satisfy the demand for (1) pathologically-confirmed primary NSCLC; (2) single lesion; (3) no histories of other cancers; (4) the interval time from CT examination to 18F-FDG PET/CT scan within 2 weeks. The exclusion criteria were as follows: (a) prior puncture biopsy, chemotherapy, or radiotherapy before PET/CT scanning; (b) unsatisfactory image quality; (c) inability to delineate the lesion on CT; (d) incomplete clinicopathologic data. Stratified random sampling was performed to allocate the study population into the development cohort (n = 337) and internal validation cohort (n = 44) at a ratio of 7:3 based on the case group. Samples in the control group within the internal validation cohort were allocated equally to those in the case group to balance the data.

Following the same admission criteria, eligible patients at the Fifth Affiliated Hospital of Sun Yat-Sen University from August 2020 to October 2022 were prospectively collected to form a prospective validation cohort (n = 114). Additionally, we recruited patients from January 2020 to July 2022 from the Jiangmen Central Hospital to create an external validation cohort (n = 179). The flowchart illustrating the patient recruitment process is presented in Fig. 1.

Fig. 1
figure 1

Flowchart of the process of patient enrollment and grouping. Institution 1, the Fifth Affiliated Hospital of Sun Yat-Sen University, Institution 2, Jiangmen Central Hospital

Clinicopathological characteristics, such as age, gender, smoking history, pathologic type, TNM staging, and 18F-FDG PET/CT parameters were extracted from the medical records. The status of DM was confirmed based on the diagnostic report of the 18F-FDG PET/CT examination.

Acquisition of CT images and interpretation of radiologic signs

The imaging process and acquisition parameters of the CT scanner and PET scanner are detailed in the Supplementary Information. Radiologic signs were manually extracted, including tumor size and location, pleural tag, pleural lesions, air bronchogram, calcification, cavitation, well-defined, lobulation, spiculation, vessel convergence, and vascular involvement. Detailed definitions of these CT radiologic signs and the interpretive process can be found in the Supplementary Information. The average of continuous variables was calculated as the final value, while for categorical variables, consensus was reached through discussion in the event of any discrepancies. Cohen's Kappa coefficient or intraclass correlation coefficient was calculated for each radiologic sign, and the interobserver agreement distribution is presented in Table S1.

Signature development and interpretation

Machine-learning (ML) algorithms were utilized to develop the radiomics signature while the DL signature was developed using neural architecture search (NAS) [19]. The 3D region of interest of the primary tumor was delineated from LDCT using ITK-SNAP software (Version 3.8.0) and the procedure was detailed in Supplementary Information. Figure 2 provides an abstract representation of the development workflows.

Fig. 2
figure 2

Workflow of the radiomics and deep learning signatures building process. For radiomics signature, a total of 1688 features were extracted and the top 20 features were further selected by the random forest regressor. Following parameter adjustments, seven machine learning models were trained to create an ensemble prediction. Regarding the deep learning signature, a 3D network architecture search was conducted. The number of shifted frameworks ranged from 3 to 9 (3 ≤ E + F + G ≤ 9). The top 10 architectures that exhibited an excellent trade-off between performance and speed were selected and trained to make an ensemble prediction. As for the evaluation and interpretation, receiver operating characteristic (ROC) curves were used and the features selected to construct the radiomics and deep learning signatures were visualized

For the development of radiomics signature, radiomics features were extracted from the 3D region of interest of tumors using PyRadiomics [20] and further selected by the random forest regressor. A total of 7 ML algorithms (Logistic Regression, BernoulliNB, KNeighborsClassifier, RandomForestClassifier, XGBClassifier, DecisionTreeClassifier, SVM) were trained to constitute the radiomics signature. To interpret the reasoning process, we used radiomics voxel mapping to reflect the contribution of each voxel to the calculation of a certain radiomics feature. The whole development process is detailed in Supplementary Information.

Based on NAS, we selected and trained the top 10 DL architectures to construct the DL signature. To track the attention weight of each voxel in the reasoning process, convolutional block attention modules [21] were added. The detailed development process is described in Supplementary Information.

For both signatures, an ensemble strategy was used. Radiomics signature was determined by averaging the predicted probabilities generated by the seven ML algorithms, while the DL signature was determined by averaging the predicted probabilities generated by the 10 candidate DL classifiers. To verify the ensemble strategy, n models were randomly picked to perform an ensemble prediction. For the radiomics signature, n ranged from 1 to 7, and for the DL signature, n ranged from 1 to 10. The progress was repeated 10 times, and the average performance indices were calculated.

Construction of the clinical-radiologic model and combined models

In the development cohort, the logistic regression model was used to construct a clinical-radiologic (CR) model by incorporating clinical characteristics and radiologic signs. Based on CR model, additional combined models were constructed by integrating the radiomics and DL signatures with CR model. The aim of constructing the combined models was to investigate the ability of the signatures to improve the CR model and identify the optimal prediction model.

Statistical analysis

Statistical analysis was performed using R software (version 4.1.0) and SPSS software (IBM, version 23.0). Continuous data were compared using the Student's t-test or Kruskal–Wallis test. Categorical data were analyzed using the Chi-square test or Fisher's exact test. The Spearman correlation coefficient was used to evaluate the correlation between variables. The diagnostic efficiency of models was mainly quantified by the area under the receiver operating characteristic curve (AUC). The confidence interval (CI) of AUC was calculated using 10000 bootstrap replicates. DeLong test was employed to compare AUCs of different methods. Additional performance metrics including accuracy, sensitivity, specificity, positive predictive value, and negative predictive value were also reported. Net Reclassification Index and Integrated Discrimination Improvement were utilized to quantify the ability of signatures to improve the CR model. For all statistical tests, a P < 0.05 indicated a statistically significant difference.

Results

Baseline characteristics

A total of 674 patients were included in the study, divided into the development cohort (n = 337), internal validation cohort (n = 44), prospective validation cohort (n = 114) and external validation cohort (n = 179). The mean age of the study population was 60 years ± 11 (± standard deviation), and 51% (n = 343) of the patients were male. Adenocarcinoma (n = 592) accounted for 88% of the total population, while squamous cell carcinoma (n = 74) accounted for 11%. Among the patients, 21% (n = 143) had DM. The incidence rates of DM across different T-staging were as follows: T1 (9%), T2 (32%), T3 (37%), and T4 (60%). The baseline characteristics of the patients are presented in Table 1.

Table 1 The distribution of clinicopathological characteristics and radiologic signs across the cohorts

The development of radiomics signature and DL signature

For the radiomics signature, we extracted a total of 1688 features (Table S2). After applying the Gini importance ranking using a random forest regressor, we selected 20 features (Fig. S1) for further analysis. These 20 features were used to train the seven classifiers, as described in the Supplementary Information. In the case of the DL signature, we employed the NAS approach to select and train the top 10 candidate model architectures, as explained in the Supplementary Information. The development cohort's AUC for each single model used to create the radiomics signature and DL signature is presented in Fig. 3a–b. It is evident that these single models exhibited varying performance. However, as demonstrated in Fig. 3c–d, larger numbers of models used for ensemble prediction generally led to improved performance.

Fig. 3
figure 3

The performance of single models and the evaluation of the proposed ensemble method using different numbers of single models. ab Receiver operating characteristic (ROC) curves of each single model used to create the radiomics signature and DL signature in the development cohort. a radiomics signature and b deep learning signature. cd The trend of different criteria when using different numbers of models to make an ensemble prediction. c radiomics signature and d deep learning signature

Evaluation of the predictive efficiency of the signature

In the internal validation (Fig. 4a), radiomics signature achieved an AUC of 0.89 [95% CI 0.87, 0.90], which was comparable to the AUC of DL signature. For prospective validation (Fig. 4b), radiomics signature obtained an AUC of 0.85 [95% CI 0.80, 0.88], while DL signature yielded an AUC of 0.79 [95% CI 0.76, 0.83]. However, Delong test indicated no statistical difference (P = 0.184) between the two. In external validation (Fig. 4c), DL signature achieved a significantly higher AUC (0.78 [95% CI 0.75, 0.80]) compared to radiomics signature (0.73 [95% CI 0.72, 0.77]) (P = 0.043).

Fig. 4
figure 4

Predictive performance of the radiomics and deep learning signatures across the validation cohorts. ac) Receiver operating characteristic (ROC) curves of the radiomics signature in the a internal validation cohort, b prospective validation cohort, and c external validation cohort. df The comparison among radiomics signature, deep learning signature, and SUVmax on various performance indices in d internal validation set and e prospective validation cohort, and f external validation cohort

To comprehensively assess the diagnostic efficiency, the performance of the signatures was compared to that of SUVmax, a well-established 18F-FDG PET/CT parameter known to be strongly linked to tumor metastasis [22, 23]. As depicted in Fig. 4e–f, both signatures exhibited similar discriminability to SUVmax in predicting DM across validation cohorts. Additionally, radiomics signature demonstrated superior specificity, whereas DL signature displayed better sensitivity across validation cohorts.

Signature reasoning pattern interpretation

To elucidate the reasoning pattern of radiomics signature, we focused on the “Original first order Energy”, which demonstrated the strongest association with DM based on both the Gini importance ranking (Fig. S1) and Spearman correlation index (Fig. S2). By employing radiomics voxel mapping, we visualized the contribution of individual voxels in calculating this radiomics feature on the CT scan. Interestingly, we observed discrepancies in voxel-level contribution between patients with and without DM, with a higher number of voxels exhibiting substantial contribution in tumors of patients with DM (Fig. 5a). Furthermore, statistical analysis confirmed that patients with DM exhibited significantly higher levels of the "Original first order Energy" (Fig. 5c).

Fig. 5
figure 5

The comparison between cases and controls on intra-nodular radiomics feature and deep learning attention weights at CT (axial view). a The radiomics voxel mapping technique visualized the most relevant radiomics feature, "Original first order Energy", in tumors of various sizes to identify distal metastasis (DM). b The visualization of attention weights of the deep learning approach in tumors of various sizes facilitated the identification of DM. The color bar illustrated the strength of these features. c The comparison of the value of Original first-order Energy between patients with DM or without. d The comparison of the average attention weights between patients with DM or without. ***P < 0.001

To interpret the reasoning pattern of the DL signature, we extracted the voxel-level attention weights from the first convolutional block attention module layer of the trained DL models. As depicted in Fig. 5b, the tumor voxels of patients with DM exhibited higher attention weights compared to those of patients without DM. Additionally, the average voxel-level attention weights were significantly higher in tumors of patients with DM (Fig. 5d).

These findings demonstrate that both radiomics and DL techniques have the capability to identify variations among voxels within CT.

Construction and evaluation of clinical-radiologic models and combined models

To develop the CR model, we initially identified the candidate covariates that were significantly associated with DM through univariate analysis. The Spearman correlation coefficients among the candidate covariates are shown in Fig. S3. Subsequently, the final model was constructed using multivariate logistic regression analysis (Table 2). Noted that pathological metrics and 18F-FDG PET/CT parameters were waived in model construction because of the inherent design for preliminary screening. The result of multivariate logistic regression analysis revealed Pleural invasion (OR: 5.00; 95% CI 1.92, 13.36; P = 0.001) and Cavitation (OR:0.18; 95% CI 0.03, 0.65; P = 0.024) were two radiologic signs that were independent risk predictors for DM in patients with NSCLC.

Table 2 Univariate and multivariable logistic regression analysis to construct the clinical-radiologic model for predicting distal metastasis

The AUC for CR model in internal validation was 0.874 [95% CI 0.862, 0.892], in prospective validation it was 0.533 [95% CI 0.516, 0.537], and in external validation it was 0.712 [95% CI 0.670, 0.721] (Table 3). In internal validation, there was no statistical difference in the discriminative abilities of the CR model, radiomics signature, and DL signature (P > 0.05). In prospective validation cohort, CR model performed inferiorly to both radiomics signature (0.533 vs. 0.854, P < 0.001) and DL signature (0.533 vs. 0.786, P < 0.001). Moreover, in external validation, CR model performed similarly to radiomics signature (0.712 vs. 0.733, P = 0.398) but worse than DL signature (0.712 vs. 0.780, P = 0.039).

Table 3 Performance of all models constructed in this study in the development cohort, internal validation cohort, prospective validation cohort, and external validation cohort

Based on CR model, combined models were developed (Table S3). In prospective validation, the CR-radiomics model, CR-DL model, and CR-radiomics-DL model achieved an AUC of 0.876, 0.813, and 0.876, respectively. In external validation, the aforementioned combined models achieved an AUC of 0.707, 0.721, and 0.705, respectively. The analysis of the Net Reclassification Index and Integrated Discrimination Improvement showed CR model was significantly improved by integrating radiomics and DL signatures in the development cohort, internal validation cohort, and prospective validation cohort (Table S4). However, in external validation, there were only limited improvements observed for CR model, and all of the combined models performed inferiorly to DL signature (CR-radiomics model: 0.707 vs. 0.780, P = 0.026; CR-DL model: 0.721 vs. 0.780, P = 0.046; CR-radiomics-DL model: 0.705 vs. 0.780, P = 0.020).

Discussion

In this study, we have several notable strengths. Firstly, we have developed two signatures that outperform previous methods in accurately predicting DM in NSCLC. Secondly, we have provided valuable insights into the interpretation of radiomics and DL signatures, shedding light on their capability to identify DM in NSCLC. Furthermore, these signatures were specifically developed using LDCT, suggesting their potential for broad applicability as a preliminary screening tool for DM in NSCLC patients.

Preoperative staging plays a crucial role in determining the appropriate management strategy for patients with clinical stage I NSCLC [24]. Our study population exhibited high incidence rates of DM in patients with T1 (9%) or T2 (32%) tumors, as evidenced by their baseline characteristics. Consequently, additional evaluation, such as an 18F-FDG PET/CT scan, is essential in such cases [25]. Given that our signatures were developed and validated using a population comprising 83% (n = 557) of patients with T1 or T2 tumors, they could serve as a valuable tool for clinicians in conducting preliminary screening for DM using LDCT during lung cancer screening. This aids in triaging patients who require further examination.

The success of DL hinges on the efficacy of its carefully crafted neural architectures. These architectures are meticulously designed by experts with extensive professional experience, involving a time-consuming process [26]. However, a new automated method called neural architecture search (NAS) has emerged, showcasing its superiority over manually designed architectures in various tasks [27,28,29]. Previously, radiomics served as the vital connection between medical imaging and personalized medicine [30]. It is commonly used in conjunction with ML algorithms for a range of tasks, and it's important to note that there is no universally applicable ML model that fits every specific task [31]. In this study, we employed an ensemble strategy to develop signatures that effectively leveraged the strengths of multiple algorithms, thus complementing each other. This fusion resulted in an improved performance of the ensemble prediction.

Early radiomics studies showed limited predictive capabilities of primary tumor features for DM, with an AUC ranging from 0.64 to 0.71 [32, 33]. Even when DL methods were applied, they failed to significantly improve performance, achieving an AUC ranging from 0.65 to 0.71 [18]. These studies also had notable limitations, such as lacking external validation and producing unexplainable predictions. In our study, we introduced cutting-edge radiomics and DL signatures that outperformed previous research in internal validation, achieving AUCs ranging from 0.786 to 0.893. Furthermore, in external validation, our signatures demonstrated superior diagnostic efficiency with AUCs of 0.73 and 0.78.

Based on our analysis of reasoning patterns, we discovered the radiomics and the DL methods operate in a similar manner by identifying voxel-level differences within the tumor. These differences are then leveraged to assess the strength of correlation between individual voxels and DM. Notably, we found that tumors of NSCLC patients with DM tend to exhibit a higher prevalence of voxels demonstrating a strong association with DM. This finding validates the efficacy of radiomics and DL methods in detecting DM, surpassing what human observers can achieve. Additionally, our interpretation of reasoning pattern reveals that utilizing a 2D approach may not be appropriate for this task. This insight can help explain why previous DL studies, which mainly utilized a 2D approach, yielded underwhelming results.

The performance of radiomics and DL signatures in this study showed variability across the validation cohorts, which could be attributed to inherent differences in the baseline characteristics among these cohorts. Furthermore, the performance might have been affected by variations in acquisition parameters across CT scanners from different vendors and institutions [34]. To improve the generalization of signatures when applied to new situations, transfer learning has been proposed as a potential solution [35]. Notably, both signatures demonstrated superior generalization ability compared to the CR model. When integrated with the CR model, these signatures significantly enhanced its discriminability. Ultimately, during external validation, the DL signature outperformed other models in terms of performance and sensitivity, suggesting its potential for optimizing the clinical workflow.

Our study has several limitations. Firstly, the data were collected from only two institutions, which resulted in a relatively small number of NSCLC patients with DM (143 patients). To ensure the reproducibility and generalizability of the findings, further validation of the signatures with a larger, multi-institution study is necessary. Secondly, there is a possibility of overfitting during the model development phase. However, it's important to note that our signatures are integrated models comprised of multiple single models. This ensemble strategy helps to partially offset the impact of overfitting of the single models. Moreover, the performance of our models in the external validation set highlights their superior generalization ability. In addition, it is worth mentioning that a comprehensive explanation of the underlying rationale behind these radiomics and DL signatures was not provided. Further investigation is required to enhance our understanding. For instance, exploring the involvement of specific genes or proteins may contribute to providing genomic biological interpretability for the signatures.

In conclusion, we developed and validated explainable LDCT-based radiomics and DL models for identifying DM in patients with NSCLC. Our models have demonstrated a high level of predictive efficiency, indicating their potential for effectively screening DM in NSCLC patients using routine LDCT scans in real-world clinical practice for lung cancer screening.