Introduction

Lung cancer is the most commonly diagnosed cancer and the leading cause of cancer-related death in the world [1], with non-small cell lung cancer (NSCLC) being the most commonly diagnosed histologic subtype [2]. Over the past 30 years, the 5-year overall survival (OS) of patients with metastatic disease has remained at 5% [3]. More recently, checkpoint blockade antibodies targeting PD-(L)1 have revolutionized cancer treatment and improved long-term survival among some patients with advanced NSCLC [4,5,6]. However, current published evidence showed that PD-1/PD-L1 antibody monotherapies yield durable (> 6 months) clinical benefit (DCB) for only a subgroup of patients (15~19.4% in phase I/II clinical trials [4, 7]). As such, robust biomarkers that are predictive of response immune checkpoint blockades at baseline are needed to avoid immune-related toxicities in patients unlikely to achieve durable clinical benefit.

Currently, PD-L1 status is the only approved diagnostic biomarker for immunotherapy; patients with positive PD-L1 status generally have higher objective response rates (ORRs) [8, 9]. However, patients with PD-L1-negative tumors can still benefit from anti-PD-(L)1 therapies [6, 10]. A recent study showed that a combination of pembrolizumab and chemotherapy achieved objective radiographic response rates, ORRs, of 22.9 to 61.4% regardless of PD-L1 status [11]. Additionally, intra-tumor heterogeneity of PD-L1 staining across biopsies is prevalent, leading to sampling bias [12]. There have been concerted efforts to utilize molecular characteristic biomarkers for prediction and, along these lines, total mutational burden (TMB) exhibited a sensitivity of 86% and a specificity of 75% in predicting DCB in response to PD-1 checkpoint blockade [13]. However, TMB by whole exome sequencing is not clinically approved in the immunotherapy setting and is similarly subject to sampling bias. Hence, we seek to identify alternative predictive biomarkers that do not suffer from these weaknesses.

Radiological images are routinely available in clinic. In particular, 18F-FDG PET/CT imaging is widely used for the staging of patients with advanced NSCLC. These medical images can be analyzed quantitatively with machine learning or “radiomics,” which is the process of converting medical images into high-dimensional mineable data. Prior studies have shown a strong correlation of radiomics signatures with genetics in NSCLC [14]. Recent advances in PET/CT radiomics have provided insights into precision medicine related to therapeutic response assessment [15, 16]. A radiomics approach has multiple advantages, including (1) being non-invasive and thus available for longitudinal sampling, (2) based on standard-of-care images and thus widely available, and (3) sampling the entire tumor and thus not subject to sampling bias [17].

The goal of this study was to determine if PET/CT radiomics features at baseline, alone, or in combination with clinical factors can predict subsequent immunotherapy response leading to DCB, as well as to predict PFS and OS in patients with advanced NSCLC, and secondarily, whether the PET/CT radiomics feature could be a prediction biomarker in real time. The workflow is presented in Fig. 1.

Fig. 1
figure 1

Radiomics workflow. The workflow includes image selection (only images with slice thickness ≤ 5 mm no artifacts, and the tumor in PET images has FDG uptake, were included), registration and semi-automatic delineation, imaging preprocessing and feature extraction, feature selection, model training, and model validation

Patients and Methods

Patients

Inclusion criteria for this study included the following: (1) patients with histologically confirmed advanced stage (stage IIIB and IV) NSCLC who were treated with anti-PD-(L)1 immunotherapy; (2) PET/CT images were acquired during the interval (less than 6 months) of the last treatment (or diagnosis) and the start of immunotherapy; 3) no other treatment were provided during the interval; and (4) follow-up time from initiation of immunotherapy treatment was greater than 6 months (Fig. 2). Initially, 146 retrospective patients were enrolled who initiated therapy between June 2011 and December 2017. These patients were randomized into training (N = 99) and test (N = 47) cohorts, with the conditions that these two cohorts were not significantly different in terms of demographics, as well as FDG PET avidity, as measured by the maximum standard uptake value (SUVmax). Using the same inclusion criteria, a prospective cohort was subsequently accrued, consisting of 48 patients who initiated immunotherapy between January 2018 and June 2019. This was used as an additional independent (prospective) test cohort.

Fig. 2
figure 2

Inclusion and exclusion diagram. The training cohort comprised clinical data, and the corresponding imaging data of the retrospective patients were used to train the radiomics signature, the DCB, and the PFS and OS nomogram models, which were further validated using the test cohort of the retrospective patients and the prospective cohort enrolled according to the same inclusion and exclusion criteria

The baseline clinical characteristics (age, sex, histology, smoke, chronic obstructive pulmonary disease (COPD) status, family history, Eastern Cooperative Oncology Group (ECOG) Scale of performance status, distant metastasis (M stage), brain metastasis, and prior treatments) and gene (EGFR, ALK, and ROS1) mutation status were obtained from the medical records. The main endpoints of this study were DCB (durable benefit), PFS, and OS. Clinically, immunotherapy response is frequently measured as DCB or NDB (no durable benefit) using binary cutoff of PFS at 6 months [18, 19]. PFS and OS were defined as the time from the start date of immunotherapy to progression (or death) and patients free of progression (or alive) or lost to follow-up which were censored at the time of the last confirmed contact. Response Evaluation Criteria in Solid Tumors (RECIST1.1) [20] was used to define progression.

This study was approved by the Institutional Review Board at the University of South Florida (USF) and was conducted in accordance with ethical standards of the 1964 Helsinki declaration and its later amendments. The requirement for informed consent was waived, as PHI was not revealed.

PET/CT imaging

18F-FDG PET/CT imaging was performed as standard diagnostic work-up before treatment with immunotherapy. Details of the retrospective PET/CT images obtained from 9 different scanners and the prospective PET/CT images obtained from 2 different scanners are shown in Supplemental Methods S1. Heterogeneity in scanner parameters was deliberately chosen to ensure generalizability of the derived predictive models. All PET images were converted into SUV units by normalizing the activity concentration to the dosage of 18F-FDG injected and the patient body weight after decay correction. Further, all the PET and CT images were resampled to 1 × 1 × 1 mm3 voxels using 3-dimensional Lagrange interpolating polynomials.

Radiomics feature extraction

The primary lung tumors of PET and CT images were semi-automatically segmented with an improved level-set method based on gradient fields [19] and were further reviewed and corrected by a radiologist with 16 years of experience (JQ) who was blinded to the outcome label. After spatial registration using a rigid transformation by maximizing the dice similarity coefficients on the condition that the maximal axial cross-sections of the nodules were aligned, Kullback–Leibler divergence (KLD) images were generated from the fused PET and CT images on a voxel-wise basis using KLD criteria [21]. Seven hundred ninety features including PET features, CT features, and KLD features were then extracted from these segmented tumors and scaled into the range [0 1] with unity-based normalization, as shown in Supplemental Methods S2-S4. After dimensionality reduction according to the internal stability (Supplemental Methods S5), we investigated the radiomics expression patterns with unsupervised clustering to observe correlations among different features and different patients.

Statistical analyses

The Wilcoxon signed-rank test and Fisher’s exact test were used to test of differences for continuous variables and categorical variables, respectively. For PFS and OS comparisons, a log-rank test was used. To determine the association of the radiomics expression patterns with clinical characteristics, a χ2 test was used. P values less than 0.05 were regarded as significant. Statistical analyses were performed with R 3.5.2 and MATLAB R2019a (Natick, MA).

Feature selection and radiomics signature building

We improved the least absolute shrinkage and selection operator LASSO method [22] by performing analyses separately on squamous carcinomas or adenocarcinoma (Supplemental Methods S6) to develop minimal feature sets for each that retained predictive information, which were then combined into a multiparametric radiomics signature (mpRS) weighted by their respective coefficients. To avoid overfitting, 100 times 5-fold cross-validation was performed in the training cohort to generate a prioritized list of the most parsimonious sets of predictive features.

Improved diagnostic validation of mpRS

To investigate the importance of the KLD features, digital phantom simulation experiments were performed. Two different digital phantoms were generated to have the same heterogeneity distribution (measured by entropy and inverse difference calculated from the 3D co-occurrence matrix) and used to detect the different heterogeneity distribution of the KLD fusion images.

Additionally, three different radiomics signatures generated from PET features (PETRS), CT features (CTRS), and the linear combination of PET and CT features (PETCTRS) using the above method were used for the comparison with mpRS according to their discrimination performances measured by the areas under receiver operating characteristics curves (AUC), classification accuracy (ACC), sensitivity (SEN), and specificity (SPEC). Furthermore, the total net reclassification improvement (NRI) was used to investigate the added value of the KLD features to PET and CT features.

ANOVA analysis was performed to compare the distribution of the radiomics signatures among the different scanner types.

DCB prediction nomogram model building

Univariable logistical regression analysis was initially conducted to identify radiomics signatures, clinical factors, and common metrics (including the image-derived features of volume, SUVmax, and metabolic tumor volume [23]) that are associated with a DCB. Covariates that yielded a significant Wald’s statistic from multivariable logistical regression analysis were used for developing the DCB prediction nomogram model.

The goodness-of-fit for the models were evaluated with Akaike information criteria (AIC) and the Hosmer–Lemeshow (HL) tests [24]. The AUC, ACC, SEN, and SPEC were also used to evaluate the discrimination performances of different models. To compare the clinical usefulness of the different models, a decision curve analysis was performed by quantifying the added benefits at different threshold probabilities [25].

PFS and OS estimation nomogram model building

The potential of the radiomics signature to predict PFS and OS was assessed and optimized in the training cohort and then validated in the two independent test cohorts by using Kaplan–Meier survival analysis. The patients were classified into high-risk or low-risk groups according to the radiomics score cutoff that maximized Youden’s index based on the training cohort.

Univariable Cox regression analyses were conducted using the radiomics signature, clinical factors, SUVmax, MTV (metabolic tumor volume), and volume. Statistically significant hazard ratios (HRs) were included in PFS and OS nomogram models using multivariable Cox regression analysis and a backward stepwise selection with AIC as the stopping rule. The C-index and AIC were used to evaluate the prediction ability of the models, and the Z test was applied to determine whether the differences between different models were significant.

Pilot study for longitudinal assessment of the radiomics signature

In order to validate the dynamic characteristic and investigate whether the radiomics signature could identify the patients receiving benefit over the ensuing 6 months, patients with follow-up PET/CT scans after the initiation of the immunotherapy were used to investigate the longitudinal trend of the radiomics signature. The patients who had DCB since the start of the immunotherapy but progressed within 6 months since the follow-up scan time were regarded as follow-up NDB, and the patients who did not progress within 6 months since the follow-up scan time were regarded as follow-up DCB. The AUC was calculated to evaluate the follow-up DCB and NDB discrimination performances of the radiomics signature, and the distribution of the radiomics signature across the baseline time and follow-up time was used to reveal the trend of the radiomics signature.

Finally, to assess the quality of this radiomics study, the radiomics quality score (RQS) was calculated according to Lambin et al. [26]

Results

Clinical characteristics

The demographic and clinical characteristics of the three patient cohorts are presented in Table 1. Among the 146 retrospective patients, there were 88 men and 58 women and the overall mean age was 65.72 (± 12.88), and the median PFS and OS were 7.52 and 10.38 months, respectively. The retrospective training and test cohorts had identical distributions of SUVmax and statistically insignificant differences in their clinical characteristics, PFS (P = 0.62) and OS (P = 0.42) (Supplemental Fig. S1). The 48 prospective patients curated from standard-of-care regimens had median PFS and OS of 6.78 and 9.95 months, respectively. Although slightly shorter compared with the retrospective cohorts, the PFS and OS were not significantly different in the prospective cohort. There were also no significant differences in histology, smoking history, sex, age, and mutation status. Twenty-two patients had follow-up PET/CT scans between 1 and 53 months of therapy, and these were used to investigate the longitudinal trend of the radiomics signature in a pilot study. Five of these patients had NDB since the start of the immunotherapy. For the remaining 17 patients who had DCB since the start of the immunotherapy, 6 of them progressed within 6 months since the follow-up scan time, while 11 patients did not progress and continued to experience clinical benefit within 6 months since the follow-up scan time.

Table 1 Demographic and clinical characteristics of patients

Feature extraction

Of the original 790 extracted features, 324 remained after filtering for internal stability. Unsupervised clustering revealed 3 clusters of patients with similar radiomics expression patterns (Fig. 3a), which were significantly associated with histology (P = 0.008, χ2 test) and response (P = 0.028, χ2 test). The prospective patients further showed similar radiomics expression patterns and validated this association of these radiomics feature patterns with histology (P = 0.041, χ2 test) and response (P = 0.085, χ2 test) (Fig. 3b).

Fig. 3
figure 3

Expression patterns of radiomics features. (a) Unsupervised clustering of all the retrospective and prospective patients on the row direction and radiomics feature expression on the column direction revealed clusters of patients with similar radiomics expression patterns. (b) Clinical patient parameters for showing significant association of the radiomics expression patterns with histology (retrospective: P = 0.008, prospective: P = 0.041, χ2 test) and response (retrospective: P = 0.029, prospective: P 0.085, χ2 test)

Feature selection and radiomics signature building

Pearson grouping was used to eliminate redundant features, resulting in 21 uncorrelated features (10 PET features, 4 CT features, and 7 KLD features). These were then used as inputs into the LASSO method. Through 5-fold cross-validation on the training cohort (Supplemental Fig. S2), 8 features emerged as the best features to construct radiomics signature, and these are shown in Supplemental data S7. Representative radiomics signatures of two patients from baseline PET/CT scan and follow-up PET/CT scan are shown in Fig. 4.

Fig. 4
figure 4

Radiomics signatures of NSCLC patients. (a, b) The CT, PET, and fusion images for a patient with ADC NSCLC obtained 1 month before and 6 months after immunotherapy, which means the patient would have DCB since the start of immunotherapy and 6 months post-immunotherapy. (c, d) The CT, PET, and fusion images for a patient with ADC NSCLC obtained 1 month before and 9 months after immunotherapy, which means the patient would have DCB since the start of immunotherapy, but would have NDB after 9-month immunotherapy. The corresponding clinical feature and radiomics scores are shown in the bottom of each image

Validation of the radiomics signature

Improved diagnostic validation of mpRS

From the simulated PET images (the first column) and simulated CT images (the second column) shown in Fig. S3, the two phantoms are classified as having the same heterogeneity and homogeneity distribution. However, from the fusion images (the third column), the two phantoms were classified as having different heterogeneity and homogeneity, which means the KLD features could identify the relative different positional relationship of the heterogeneity.

There were significant differences in the four radiomics signature scores between DCB and NDB patients in the training cohort (PETRS: P < 0.001; CTRS: P = 0.020; PETCTRS: P < 0.001; mpRS: P < 0.001). Except for the CTRS, the other three radiomics signatures had significant differences between DCB and NDB patients in the retrospective (PETRS: P = 0.006; PETCTRS: P = 0.003; mpRS: P < 0.001) and prospective (PETRS: P = 0.019; PETCTRS: P = 0.009; mpRS: P < 0.001) test cohorts. The mpRS achieved the highest AUCs of 0.86 (95%CI 0.79–0.93), 0.83 (95%CI 0.71–0.94), and 0.81 (95%CI 0.68–0.92) in the training, retrospective test, and prospective test cohorts, respectively (Supplemental Table S1 and Fig. S4). Compared with PET+CT features, the inclusion of the KLD features yielded a total net reclassification improvement (NRI) of 0.50 (95%CI 0.11–0.88, P = 0.011), 0.55 (95%CI − 0.012–1.11, P = 0.055), and 0.94 (95%CI 0.44–1.45, P < 0.001) in the training, test, and prospective test cohorts, respectively, which showed significantly improved classification accuracy for response prediction. Therefore, only mpRS was used for the subsequent analyses.

Box plots and ANOVA analyses of each of the radiomics signatures are shown in Supplemental Fig. S5, which illustrates that these signatures are stable across 9 different equipment manufacturers (P > 0.05), with mpRS being the most stable signature with the fewest outliers.

DCB prediction nomogram

Univariable logistical regression analysis on the training cohort identified mpRS, distant metastasis, and ECOG status as strong predictors for response, and these were also validated in the test cohorts (P < 0.05, Supplemental Table S2). Through multivariable logistical regression analysis (Supplemental Table S3), ECOG scale and distant metastasis were predictive of a DCB response combined with the mpRS (Fig. 5a), and adding these clinical variables generated a higher AUC of 0.89, 0.86, and 0.86 in the training, retrospective test, and prospective test cohorts, respectively (Supplemental Table S1 and Fig. S4). This model was well calibrated in all three cohorts (Fig. 5c). The inclusion of ECOG scale and distant metastasis yielded a significant total NRI of 0.79 (95%CI 0.47–1.01, P < 0.001), 1.05 (95%CI 0.55–1.54, P < .001), and 1.20 (95%CI 0.75–1.65, P < .001) in the training, retrospective test, and prospective test cohorts, respectively.

Fig. 5
figure 5

Nomograms and the corresponding calibration curves. (a) and (c) are the nomograms constructed with mpRS and clinical characteristics to estimate the probability of DCB, along with the assessment of the model calibration in the training cohort, retrospective test cohort, and prospective cohort, respectively (e.g., for a patient with MRS of 0.6, ECOG 1 at stage M0, his total point is 95 (MRS 0.6 corresponding to point 54, ECOG 1 corresponding to point 28, no distant metastasis corresponding to point 13, 54 + 28 + 13 = 95), which corresponds to a DCB probability of 0.79). (b) is the decision curves for different radiomics signatures and DCB response prediction models on all the patients. (d) and (f) are the nomograms constructed with mpRS and clinical characteristics to estimate the risk of progression, along with the assessment of the model calibration in the training cohort, retrospective test cohort, and prospective cohort, respectively (e.g., for a ADC patient with MRS of 0.6 and ECOG 1, his total point is 85 (MRS 0.6 corresponding to point 46, ECOG 1 corresponding to point 39, ADC corresponding to point 0, 46 + 39 + 0 = 85), which corresponds to a 6-month PFS probability of 0.71, 1-year PFS probability of 0.5, and 2-year PFS probability of 0.42). (e) is the scatter plot of nomograms’ predicted 6-month PFS probability of individual patients versus PFS time of the corresponding patients. (g) and (i) are the nomograms constructed with mpRS and clinical characteristics to estimate the risk of death, along with the assessment of the model calibration in the training cohort, retrospective test cohort, and prospective cohort, respectively (e.g., for a ADC patient with MRS of 0.6 and ECOG 1, his total point is 89 (MRS 0.6 corresponding to point 39, ECOG 1 corresponding to point 50, ADC corresponding to point 0, 39 + 50 + 0 = 89), which corresponds to a 6-month OS probability of 0.94, 1-year OS probability of 0.81, and 2-year OS probability of 0.59). (h) is the scatter plot of nomograms’ predicted 1-year OS probability of individual patients versus OS time of the corresponding patients. For (e) and (h), the red points represent the training cohort, the green points represent the test cohort, and the blue points represent the prospective cohort. The vertical dotted line and the horizontal dotted line mean the actual PFS and OS probability obtained with Kaplan–Meier estimate at the certain time point, and the horizontal dotted line means the 6-month and 1-year time points. The pair of the vertical and horizontal dotted lines divided the graph into 4 quadrants, the points in the lower left and top right quadrants correspond to the accurate prediction

The decision curves shown in Fig. 5b indicate that the combined (clinical+radiomics) DCB nomogram model had the highest overall net benefit across the majority of the range of reasonable threshold probabilities in all the patients compared with radiomics or clinical signatures alone.

PFS and OS prediction nomogram

Through Kaplan–Meier survival analysis (cutoff = 0.41), mpRS was able to significantly predict PFS and OS in training (P < 0.001, P < 0.001), retrospective test (P = 0.001, P = 0.002), and prospective test cohorts (P < 0.001, P = 0.002), respectively. Patients with higher radiomics scores had longer PFS and OS (Supplemental Fig. S6). According to univariable Cox regression analysis (Supplemental Tables S4 and S6), mpRS, histology, and ECOG scale were significantly associated with PFS and OS in the training and test cohorts. To investigate the prognostic value of the mpRS within the different subgroups of histology, stratified Kaplan–Meier survival analyses were also performed. As shown in Fig. 6, even though patients with adenocarcinoma (ADC) had better PFS than those with squamous cell carcinoma (SCC), the mpRS was predictive of PFS and OS for both histologies. Further multivariate Cox proportional hazards models (Supplemental Tables S5 and S7) including mpRS, histology, and ECOG scale were built for PFS and OS estimation, which are expressed as nomograms shown in Fig. 5 d and g. The calibration curves of the mpRS nomogram on training, retrospective test, and prospective test cohorts are shown in Fig. 5 f and i for PFS and OS, respectively, which show excellent agreements between the nomogram predictions and actual observation. Additionally, the scatter plots of nomogram-predicted 6-month PFS and 1-year OS probability for individual patients versus the corresponding PFS and OS time are provided in Fig. 5 e and h. Spearman’s rank correlation coefficients for individualized predictions were > 0.6 with P < 0.001, suggesting a strong positive correlation.

Fig. 6
figure 6

Stratified Kaplan–Meier survival curves. (a–c) are stratified Kaplan–Meier survival curves of PFS according to mpRS on training, retrospective test and prospective test cohorts within the different subgroups of histology. (d–f) are stratified Kaplan-Meier survival curves of OS according to mpRS on training, retrospective test, and prospective test cohorts within the different subgroups of histology

The quantitative results of different models are shown in Table 2, which shows the combined (clinical+radiomic) nomograms yielded significantly strong prediction results (P < 0.05) with C-indexes of 0.74 (95%CI 0.68–0.80), 0.74 (95%CI 0.66–0.82), and 0.77 (95%CI 0.69–0.84) for PFS estimation and C-indexes of 0.83 (95%CI 0.77–0.88), 0.83 (95%CI 0.71–0.94), and 0.80 (95%CI 0.69–0.91) for OS estimation in the training, test, and prospective test cohorts, respectively.

Table 2 Performance of different models in PFS and OS estimation

Pilot study for longitudinal assessment of the radiomics signature

A further analysis using subsequent follow-up scans, when available, showed the mpRS generated from the follow-up PET/CT images during treatment could also predict follow-up DCB with an AUC of 0.82 (95%CI 0.63–1.00). Further, it had a decreasing trend with time (Supplemental Fig. S7), suggesting that the risk of progression increased with time.

Radiomics quality score

Radiomics is a rapidly maturing field, and qualities of study designs and their results can be challenging to evaluate. To address this, Lambin et al. developed a 36-point “radiomics quality score” (RQS) metric [26]. The criteria are described in Supplemental Table S8, which shows that the current study had a RQS of 24. To put this in perspective, a recent meta-analysis [27] analyzed 77 radiomics publications and documented that the mean RQS across all studies was 9.4 + 5.6, indicating that the current study is in the upper 5 percentile of radiomics study designs.

A TRIPOD Checklist following reporting guidelines for prediction model development and validation (https://www.equator-network.org/reporting-guidelines/tripod-statement/) has also been provided in Supplemental Table S9, which further validated the integrity of the work.

Discussion

In this work, we developed a radiomics signature to predict DCB in immunotherapy-treated NSCLC patients, which was successfully validated in independent retrospective and prospective test cohorts. In addition, combining this signature with ECOG status and histology into a nomogram further facilitated the individualized prediction of PFS and OS prior to initiation of checkpoint blockade immunotherapy.

PET/CT radiomics studies typically extract image-derived quantitative features independently from PET or corresponding CT images [28,29,30,31,32], and mutual information between PET and CT images have only been assessed by qualitative metrics rather than quantitative descriptors [33,34,35,36]. By contrast, in this work, we constructed KLD features as quantitative descriptors of mutual information between co-registered PET and CT, and these significantly improved the prediction results. To illustrate the importance of the KLD features, digital phantom simulation experiments were performed. Supplemental Fig. S3 shows that different phantoms (a and b) could not be distinguished by PET or CT heterogeneity features. However, the KLD fusion images, which reflect the relative different positional relationship of the heterogeneity, could distinguish these phantoms.

When investigating the informative components of mpRS formula, we found multiple texture features (PET_SRLGE, KLD_SZE) were positively correlated, suggesting that the more heterogeneous tumors had a larger probability to have a DCB. This was a bit surprising, as prior studies have shown that more heterogeneous tumors with CT textures had worse response to radiation or chemotherapy [14]. A recent study showing the patients who had faster growing tumors before initiation of immunotherapy had better responses [37] may suggest that the more aggressive tumors that respond worse to conventional therapies are more susceptible to immune modulation. In terms of shape, those tumors with more convexity (smaller CHDensity) had a higher probability of a DCB. This could be explained by Saeed-Vafa et al.’s study [38] that PD-L1 cells are able to form a more rounded mass due to the ability of evading immune attack and coexistence with the vasculature, and high PD-L1 expression is associated with significantly longer PFS and longer OS [8, 9]. In terms of metabolic activity, tumors with a lower mean SUV and HU (KLDiv_mean) had a larger probability of having a DCB. This is expected, as more metabolically active tumors will produce more lactic acid [39] and lactic acid is a potent inhibitor of effector T cell function [40].

Notably, only 4 CT features remained after Pearson grouping and internal stability comparison, and the CT radiomics score constructed with these 4 features did not perform as well compared with mpRS, with an AUC of 0.69 and 0.64 in test and prospective test cohorts, respectively. These AUCs are smaller than those observed in a recent Annals of Oncology paper, which generated an AUC = 0.79 in the NSCLC primary tumors with contrast-enhanced computed tomography (CE-CT) scans [41]. This was also surprising, given our prior radiomics analyses of CT images to predict immunotherapy response [3]. A possible explanation for the low predictive power of CT in the current study is that the resolution of CT images in PET/CT is lower and non-contrast enhanced, compared with diagnostic CTs, and this also may lead to lower discrimination ability. Secondly, given no inter-scanner correction was used for CT features, only a few CT features remained after filtering for internal stability. Additionally, other possible reasons that the 4 CT features were not selected in the mpRS formula were the KLD features had included the morphological information, and some studies had shown that the metabolic modifications on PET are more predictive than morphological modifications on CT especially in early response prediction [11, 42].

Moreover, as Moffitt Cancer Center is a referral hospital, many patients’ PET/CT images were acquired from different institutions, leading to large variability in acquisition and reconstruction methods. As we first filtered for internal stability, we selected for features that would be stable across different scanners. We contend that this is a strength of the current approach, as it allows for a more generalizable and transportable model (Supplemental Fig. S5).

SUVmax and MTV were shown effective in predicting efficacy and survival at 1 month after immunotherapy in some studies [42]. However, according to Supplemental Table S2, these factors were not significant factors in predicting DCB in our current study. On the one hand, this may be because the time points were different (1 month versus 6 months), and SUVmax and MTV may play different roles at different time points. On the other hand, this may be due to the different cohorts. Those prior studies were single institution with rigorous acquisition protocols, whereas our study utilized data from multiple scanners and multiple institutions, acquired under standard of care. We have previously shown that reproducibility of SUVmax and SUVmean under standard-of-care multi-institutional PET+CT imaging is lower than that which can be achieved at a single institution under a research protocol [43].

The present study also possesses some limitations. First, PD-L1 status of most patients was unavailable in this dataset and hence could not be used for comparing with the mpRS. Second, given PET/CT is not commonly used in clinical follow-up, only 22 patients had follow-up PET/CT scans that could be further analyzed. Their radiomics scores had a decreasing trend with time (Supplemental Fig. S7), suggesting that the risk of progression increases with time, which may be caused by the emergence of resistance. And the significant prediction results (AUC = 0.82) showed the radiomics signature might be used as a prediction biomarker in real time to guide the following immunotherapy pending further prospective validation with larger cohorts. Third, in order to capture enough data for training, an interval time of 6 months in inclusion criterion was fairly broad. However, the median (average) interval time of these data was 1.73 (2.13) months for the retrospective cohort, and most of the PET/CT images (76%) were acquired within 3 months. Through ANOVA analysis (Fig. S8), there was no significant difference of mpRS among different subgroups divided by interval time for DCB and NDB patients on both training and test cohorts. Additionally, the AUCs were decreased with the interval time larger than 3 months and 2 months in the training and test cohorts, respectively, but the differences were not significant. One possible reason is the small number of patients with the interval time larger than 3 months. And another possible reason could be the mpRS was not significantly correlated with the volume of the tumor (Spearman’s rho = − 0.20, P = 0.06 for the training cohort; Spearman’s rho = − 0.17, P = 0.27 for the test cohort), which means the increased volume within the interval time may not lead to the significant variations of mpRS in predicting DCB. Given the subgroup patients with 0–1-month interval time has the largest AUC in both training and test cohorts, the interval time of all the patients in the prospective cohort was less than 1 month. Lastly, there were significant differences in distant metastasis between the retrospective and prospective cohorts, but this did not affect the final prediction of DCB due to the incorporation of distant metastasis into the DCB nomogram model. Given distant metastasis is not a significant factor for PFS and OS prediction, this difference also would not affect the PFS and OS prediction.

Conclusions

In conclusion, an effective and stable radiomics signature combining PET and KLD features was identified and may serve as a predictive biomarker for immunotherapy response. Furthermore, radiomics nomograms well demonstrated the incremental value of the radiomics signature for individualized DCB response and PFS and OS estimation and have a potential to be used to guide individual pre- and post-immunotherapy pending further external validation with larger cohorts.