Introduction

Extranodal natural killer/T cell lymphoma (ENKTL) is an aggressive malignancy of putative NK cell origin, with a minority deriving from the T cell lineage [1, 2]. ENKTL is much more common in Asia and Latin America, comprising approximately 3 to 10% of all lymphomas in East Asia, but less than 1% in Western countries. Establishing an optimal treatment strategy for ENKTL has yet to be determined, as accurate assessments are crucial for prognosis prediction and individualised treatment strategy decisions [3, 4].

18F-Fluorodeoxyglucose (18F-FDG) positron emission tomography-computed tomography (PET/CT) is routinely used for lesion detection, response monitoring, and prognostication assessments in ENKTL patients. The maximum standardised uptake value (SUVmax) is the most widely used PET imaging parameter for prediction; moreover, metabolic tumour volume (MTV) and total lesion glycolysis (TLG), which are based on both SUV and tumour volume, have also been reported to be significant prognostic biomarkers for ENKTL [5, 6]. However, these metabolic parameters do not fully reflect the spatial distribution of a tracer, which has been suggested to correlate with intra- and intertumoural heterogeneity, and even worse prognosis and survival [7]. Over the past decade, an emerging and promising field, radiomics, has been implemented to extract and analyse a large amount of advanced quantitative imaging features with high throughput from medical images to provide abundant tumour heterogeneity information [8, 9]. The most widely used medical imaging modalities in radiomic research are computed tomography (CT) and magnetic resonance imaging (MRI). Currently, radiomic features extracted from PET are considered to contain large amounts of underlying information distinct from that provided by CT and MRI [10, 11]. A continuously increasing number of studies have reported promising results regarding the value of PET-based radiomic features for various types of solid tumours [12,13,14,15,16]. Nevertheless, few published studies have reported the diagnostic, staging, or prognostic value of PET-based radiomic features in lymphoma [17,18,19,20,21]. In addition, combining multiple imaging biomarkers as a predictive signature using radiomic methods, rather than individual analyses, is a promising and useful approach for prognosis prediction and clinical management [22].

Thus, in this retrospective study, we aimed to develop the radiomic signatures (R-signatures) using 18F-FDG PET radiomic features for the prediction of progression-free survival (PFS) and overall survival (OS) in patients with ENKTL. Subsequently, the radiomics-based model integrating the R-signatures and clinical factors was established and then validated and compared with the metabolism-based model integrating metabolic parameters and clinical factors. The models are visualised as nomograms.

Materials and methods

Patients and follow-up

Ethical approval was obtained, and the requirement for informed consent from patients was waived. The inclusion criteria included as follows: (a) pathologically diagnosed ENKTL between January 2011 and January 2017, (b) pretreatment 18F-FDG PET/CT, and (c) stage I to II patients who received 2 cycles of chemotherapy with LVP (l-asparaginase, vincristine, prednisone) and concurrent radiotherapy with 2 cycles of cisplatin chemotherapy, followed by 2 cycles of LVP; stage III to IV patients who received chemotherapy with LVP with or without radiotherapy if a response was observed, and otherwise received second-line therapy. In total, 110 consecutive patients were enrolled and randomly allocated to two cohorts (82 and 28 patients in the training and validation cohorts, respectively). Follow-up was performed every 3 months after the completion of treatment. The last follow-up was conducted in April 2018. The end-points of this study were PFS and OS. PFS was defined as the interval between the date of diagnosis and the date of the first relapse, progression, or death. OS was defined as the interval between the date of diagnosis and the time of death.

PET scanner and acquisition parameters

18F-FDG PET/CT examinations were performed on a Gemini GXL PET/CT scanner equipped with a 16-slice CT (Philips Medical System). After at least 6 h of fasting, 190–375 MBq of 18F-FDG was administered intravenously. The blood glucose level was controlled to be lower than 8.0 mmol/L. Whole-body PET/CT scans were started 60 min after radiopharmaceutical injection. Emission data were acquired for 2 min per bed position. PET emission acquisition was performed in 3D mode (3D-RAMLA): the dimensions of the in-plane matrix were 4 mm × 4 mm, and the slice thickness was 4 mm. All examinations were reconstructed using an OSEM algorithm, and the CT acquisition data were used for attenuation correction.

Segmentation and feature extraction

Two nuclear medicine physicians independently performed segmentation using the Local Image Features Extraction (LifeX) package (version 4.00, http://www.lifexsoft.org) [23]. First, the lymphoma lesions were delineated manually. Then, the regions of interest (ROIs) were defined based on a threshold of 40% of the SUVmax of the defined lesions [24], and spatial resampling (2 × 2 × 2 mm), absolute intensity resampling (0–20), and intensity discretisation (number of grey levels = 64, size of bins = 0.3125) were performed [25]. A total of 41 features (Supplementary Materials) of the ROIs were extracted as follows: first-order metrics extracted from the histogram and shape; features derived from the grey-level co-occurrence matrix (GLCM), the neighbourhood grey-level different matrix (NGLDM), the grey-level run length matrix (GLRLM), and the grey-level zone length matrix (GLZLM); and conventional metabolic parameters, including the SUVmax, MTV, and TLG.

Radiomic feature selection and model building

First, the interobserver repeatability of the segmentation was evaluated using the intraclass correlation coefficient (ICC) method, and the features with an ICC greater than 0.70 were selected [26]. Then, the least absolute shrinkage and selection operator (LASSO) Cox regression algorithm was applied to the selected features [27]. Cross-validation was applied to optimise the value of λ, the coefficients of indistinctive covariates were reduced to zero, and the remaining nonzero coefficients were selected. The nonzero coefficients of the selected features were defined as radiomic scores (R-scores). We calculated the combination of R-scores for all selected radiomic features, defined as the R-signature. We determined the optimal threshold value of the R-signature by the receiver operating characteristic (ROC) curve and divided patients into high- and low-risk groups. The potential association of the R-signature with PFS and OS was evaluated using the Kaplan-Meier analysis and log-rank test.

We used the univariate Cox regression to select the significant prognostic factors of PFS and OS, respectively. R-signature and clinical variables were entered into a multivariate Cox regression to build the radiomics-based model for PFS and OS prediction. Likewise, metabolic parameters and clinical variables were entered into a multivariate Cox regression to build the metabolism-based model. Models were then visualised as nomograms. The flowchart of model building is presented in Fig. 1.

Fig. 1
figure 1

Flowchart showing the development of the models

Model validation

The discrimination of the models was assessed using the Harrell concordance index (C-index) [28, 29]. Bootstrap analyses with 1000 resamples were used to obtain a corrected C-index. The calibration of the models was assessed by Hosmer-Lemeshow tests and calibration curves, and p > 0.05 accounted for a nonsignificant deviance from the theoretical perfect calibration [30].

Statistical analysis

Statistical analyses were performed using IBM SPSS Statistics (version 19.0, IBM Corp) and R software (version 3.4.2; http://www.R-project.org). All tests were two-sided, and p values of < 0.05 were considered significant.

Results

The clinical characteristics of the patients are summarised in Table 1. No differences were found between the training and validation cohorts (p = 0.149–0.945). The median follow-up time was 33 months (2–90 months). As of the final follow-up, the numbers of survivals and deaths were 63 (57.3%) and 25 (22.7%), respectively.

Table 1 Baseline characteristics of the patients in the training and the validation cohorts

R-signature construction and assessment

Six radiomic features had low ICCs (ICC < 0.70), and twenty-seven radiomic features had good repeatability (ICC ≥ 0.85). The thirty-two features with ICCs ≥ 0.70 extracted in the second round were selected for further analysis. According to the LASSO results (Fig. 2), we obtained 4 and 3 radiomic features with nonzero coefficients for PFS and OS, respectively, in order to calculate the R-signaturePFS and the signatureOS. The results of the ICC analysis and the formulas for the R-signatures can be found in the Supplementary Materials.

Fig. 2
figure 2

Feature selection for the prediction using the LASSO model, tuning parameter (λ) selection in the LASSO model involved the use of tenfold cross-validation with minimum criteria ((a) PFS and (b) OS); coefficient profiles of the radiomics features ((c) PFS and (d) OS)

The ROC-AUCs of the R-signaturePFS were 0.788 (95% CI = 0.682–0.895) and 0.473 (p = 0.803) in the two cohorts. When analysing the association of the R-signaturePFS with PFS, the results of the log-rank test indicated significant discrimination between the high- and low-risk groups in the training cohort (Fig. 3a), but no discrimination in the validation cohort (Fig. 3c). The ROC-AUCs of the R-signatureOS were 0.637 (95% CI = 0.488–0.786) and 0.730 (95% CI = 0.548–0.912) in the two cohorts. We also observed that the R-signatureOS was significant for classifying the patients into high- and low-risk groups in the two cohorts (Fig. 3b, d).

Fig. 3
figure 3

The Kaplan-Meier survival curves (a R-signaturePFS, training cohort; b R-signatureOS, training cohort; c R-signaturePFS, validation cohort; and d R-signatureOS, validation cohort)

Building of the radiomics-based model

The results of the univariate and multivariate analyses are listed in Tables 2 and 3. In the univariate analysis, the Eastern Cooperative Oncology Group performance status (ECOG PS), Ann Arbor stage, lactate dehydrogenase (LDH), International Prognostic Index (IPI), SUVmax, MTV, TLG, and the R-signaturePFS were associated with PFS; bone marrow (BM), ECOG PS, Ann Arbor stage, LDH, IPI, SUVmax, MTV, TLG, and the R-signatureOS were associated with OS.

Table 2 The results of the univariate Cox regression analysis
Table 3 The results of the multivariate Cox regression analysis

For PFS, the R-signaturePFS and significant clinical variables in the univariate analysis were selected for inclusion in the multivariate Cox regression. The R-signaturePFS and IPI remained as prognostic factors in the multivariate analysis and were used to build the radiomics-based model (Fig. 4a). For OS, the radiomics-based model was built using the R-signatureOS and ECOG PS (Fig. 4b), which were independent prognostic factors of OS identified by the multivariate analysis.

Fig. 4
figure 4

The nomograms of the radiomics-based models (a) for PFS and (b) for OS

Building of the metabolism-based model

For PFS, the significant clinical variables and metabolic parameters in the univariate analysis were entered into the multivariate Cox regression, and MTV and IPI were identified as independent prognostic factors and were used to build the metabolism-based model. Likewise, the metabolism-based model for OS prediction was built using the SUVmax, MTV, and ECOG PS.

Validation and comparison of the model

The results of the C-index and the Hosmer-Lemeshow test are shown in Table 4. The calibration curves of the models are shown in the Supplementary Materials. For PFS prediction, the radiomics-based model showed better discrimination than the metabolism-based model in the training cohort (C-index = 0.811 vs. 0.751) but poorer discrimination in the validation cohort (C-index = 0.588 vs. 0.693). The Hosmer-Lemeshow test showed that the calibration of the radiomics-based model was poorer than that of the metabolism-based model (training cohort: p = 0.415 vs. 0.428; validation cohort: p = 0.228 vs. 0.652). The calibration curves showed that the calibration of the two models was better in the training cohort than that in the validation cohort (Supplementary Figs. S1 and S2).

Table 4 The results of the C-index and the Hosmer-Lemeshow test

For OS prediction, the discrimination of the radiomics-based model was poorer than that of the metabolism-based model in the training cohort (C-index = 0.818 vs. 0.828) and the validation cohort (C-index = 0.628 vs. 0.753). The Hosmer-Lemeshow test indicated that the calibration of the radiomics-based model was poorer than that of the metabolism-based model (training cohort: p = 0.853 vs. 0.885; validation cohort: p < 0.05 vs. 0.913). According to the calibration curves, the calibration of the two models was better in the training cohort than that in the validation cohort (Supplementary Figs. S3 and S4).

Discussion

In the present study, we developed R-signatures with moderate predictive ability in a training cohort and a validation cohort (R-signaturePFS: AUC = 0.788 and 0.473, respectively; R-signatureOS: AUC = 0.637 and 0.730, respectively). Although no significant association was found between the R-signaturePFS and PFS in the validation cohort, the R-signatures were associated with PFS in the training cohort and with OS in both cohorts. These results provide evidence that radiomic features extracted from pretreatment 18F-FDG PET images can predict lymphoma outcomes. We further developed radiomics-based models combining R-signatures and clinical variables to predict PFS and OS among patients with ENKTL, and the models were then visualised via nomograms with the aim of identifying patients at a high risk of early progression and death who could be offered an alternative treatment strategy. The radiomics-based models achieved good predictive values (PFS: C-index = 0.811 and 0.588, p = 0.415 and 0.228; OS: C-index = 0.818 and 0.628, p = 0.853 and < 0.05). However, the performance of the radiomics-based model was inferior to that of the metabolism-based model in the two cohorts.

The main goal of radiomics is to build a prediction model for clinical outcomes using selected radiomic features, and integrating radiomic features with traditional prognostic indicators (clinical indicators) in one model can improve the prediction performance of a single prognostic indicator [8, 9, 31, 32]. Traditional PET metabolic parameters have been confirmed to be significant prognostic indicators for outcomes of patients with ENKTL and are widely used in clinical management [4, 5]. However, most radiomic indices extracted from PET images have been considered to be significantly correlated with metabolic parameters, especially MTV and TLG [33, 34], and such correlations are postulated to be inherent to the definitions of the features as opposed to being variable depending on the tumour type or acquisition/reconstruction protocols [11], which were also reported to differ in susceptibility to methodological, biological, and metabolic features [35, 36]. Based on these considerations, we built radiomics-based models to investigate the potential added prognostic value of radiomic features in ENKTL patients by comparing the models with metabolism-based models. Our results suggested that the radiomics-based models may provide limited prognostic information for ENKTL patients compared with the metabolism-based model.

Our findings are in line with those of recent studies indicating that the performance of PET-derived radiomic features for tumour prognosis prediction is poor compared with that of PET-derived metabolic parameters. In a cohort of 82 patients with aggressive B cell lymphoma, MTV was correlated with the response to therapy, but texture features could not predict the therapy response, although several features were correlated with the presence of a residual mass and outcomes [18]. Rogasch et al reported that the asphericity feature could predict the response after chemotherapy in 50 children with paediatric Hodgkin’s lymphoma (HL), although MTV showed a better performance [35]. In contrast, several studies have reported promising results for the utilisation of radiomic features derived from PET to risk-stratify patients with lymphoma. Lue et al found that the intensity nonuniformity of pretreatment PET was a prognostic indicator in 42 patients with HL and may outperform MTV [31]. In a cohort of 17 patients with ENKTL, texture features (dissimilarity and LISZE) extracted from pretreatment PET images were independent predictors of PFS, whereas the SUVmax, MTV, and TLG were not associated with PFS [37]. Wang et al demonstrated a relationship between PET radiomic features and OS in 19 patients with renal/adrenal lymphoma, and MTV was not an independent factor [38].

These differences may be attributed to many sources. First, as mentioned above, several studies incorporated radiomic and metabolic features into one model, which may result in an underlying risk of redundancy and underestimation of the performance of both types of features. Second, a sample size of at least 10 to 15 patients per predictor variable has been proposed to be required to produce valid estimates for multiple regression models [39]. We reduced the number of radiomic features to 4 and 3, which is reasonable for minimising false detection rates. Previous studies with small sample sizes and datasets may have a risk of bias. Additionally, many variables may affect the stability and prognostic value of PET-based radiomic features, such as the scanner, segmentation, reconstruction parameters, formulas used to define the radiomic features, and software [40, 41]. Heterogeneity among different studies may have contributed to controversial results. Lastly, validation analysis has been regarded as an indispensable step in radiomics research to show the potential value of a radiomics model for clinical application [8, 42]. In this study, the performance of the models in the internal validation cohort was inferior to that in the training cohort, suggesting that the stability of the models should be considered with caution. Comparing previous findings directly is difficult considering the limited number of lymphoma radiomics studies that have performed internal and/or external validation.

Segmentation is an important step in radiomics research as edges can substantially affect feature values. No consensus on optimal segmentation is available for lymphoma because lymphoma lesions usually have heterogeneous sizes, shapes, and locations. On PET images, some normal organs with high uptake and physiological FDG uptake and excretion may cause confusion [43]. Therefore, identifying accurate and robust segmentation methods for lymphoma is important. We defined ROIs using a semiautomatic threshold-based method (the 40% threshold segmentation method) and assessed the interobserver repeatability of the segmentation, and the results of the ICC analysis indicated that 27 of 38 radiomic features had good repeatability (ICC ≥ 0.85), which is consistent with previous studies demonstrating that most PET image features exhibited high stability in test-retest and interobserver analyses [40, 44]. Several different segmentation methods have been devised for lymphoma [43, 45, 46]. Hu et al proposed an entropy-based optimisation strategy to detect and segment lymphoma in PET images and reported a good performance [43]. Hu et al proposed an automatic approach for ENKTL segmentation that was more stable than traditional deep-learning segmentation [46].

The limitations of the present study are as follows. First, this was a single-centre, retrospective study, and external validation was not performed, which may have impacted patient selection, the examination protocol, and the radiomic quantification results. Second, we used only 18F-FDG PET images to extract radiomic features. PET images have relatively low spatial resolution and high noise, which may influence lesion identification [9]. A combination of PET and CT images may expand the feature pool and lead to the discovery of more predictive radiomic features. In addition, we extracted only traditional radiomic features, which may contain limited information. Other types of features (model- or transform-based features and deep-learning features) have gained popularity because they are more specific to data and clinical outcomes [47], and such features should be further investigated to better understand the predictive power of radiomics.

Conclusions

In conclusion, a pretreatment 18F-FDG PET radiomics-based model was designed and showed significant stratification power in predicting PFS and OS in ENKTL, but the performance was inferior to that of the metabolism-based model. Therefore, further multi-centre, prospective studies with external validation are required to ensure that the results are reproducible and do not require refinement to achieve a higher level of evidence.