Introduction

Glioma is the most common primary brain tumor, accounting for approximately 51% of all central nervous system tumors [1]. Lower grade gliomas (LGGs) are WHO grade II and III tumors; they show heterogeneous survival outcomes, with some LGGs exhibiting high therapeutic sensitivity [2]. Thus, prior studies have investigated important clinical predictors of survival for patients with LGG, such as age and extent of tumor resection [3, 4]. The importance of molecular biomarkers, such as isocitrate dehydrogenase (IDH) mutational status in survival prediction for LGG has been recognized [5]; IDH mutation is associated with better survival outcomes for patients with LGG [5,6,7]. Additionally, imaging biomarkers, such as enhancement pattern, have been investigated for survival prediction [8]. However, most imaging parameters in prior studies were heterogeneous and non-quantitative [8,9,10]. Here, we attempted to extract extensive quantitative information by radiomics analysis and apply it for survival prediction using machine learning tools.

Radiomics is based on the conversion of medical images to high-dimensional quantitative information that reflects underlying pathophysiology, especially tumor heterogeneity [11]. Radiomics has recently gathered attention for its role in predicting genetic status and survival in patients with brain tumors [12,13,14]. However, many prior studies have focused on the prediction of survival or genetic alterations such as IDH status of LGG by radiomics alone, and some have not performed external validation. However, decision-making for LGG patients depends on multi-level prognostic information including IDH status and clinical information such as age or extent of tumor resection; therefore, we focused on the role of radiomics in predicting survival within such multi-level prognostic information. We also conducted external validation using the TCGA/TCIA (The Cancer Genome Atlas/The Cancer Imaging Archive) dataset to assess the generalizability of radiomic phenotyping. The aim of this study was to evaluate whether MRI-based radiomic features could improve the accuracy of survival predictions for LGG over clinical and IDH status.

Materials and methods

This retrospective study was conducted with approval from our institutional research board (IRB). The need for acquiring informed consent was waived. The National Institute of Health/National Cancer Institute–approved TCGA and TCIA databases contain publicly available datasets in which all data are anonymized [2, 15]. Therefore, IRB approval from individual institutions was not required to use information from TCGA and TCIA databases in this present study.

Study cohort

A flow chart of patient enrollment is shown in Fig. 1. Two cohorts were enrolled, i.e., an institutional cohort and a TCGA/TCIA cohort (hereinafter referred to as the “TCGA cohort”). In the institutional cohort, 232 patients were pathologically diagnosed with WHO grade II and III gliomas and underwent preoperative MRI between September 2007 and January 2017. The inclusion criteria were as follows: (1) histopathologically confirmed WHO grade II or III glioma; (2) known IDH mutation status; (3) no prior brain biopsy, surgery, or other treatments; (4) age ≥ 18 years; and (5) availability of preoperative T2-weighted (T2), fluid-attenuated inversion recovery (FLAIR), and T1-weighted contrast-enhanced (T1C) images. Based on these criteria, the following were excluded: (1) unknown IDH mutation status (n = 5); (2) previous history of biopsy or operation of brain tumor (n = 3); (3) absence of FLAIR or T1C images (n = 16); and (4) error in image processing (n = 3).

Fig. 1
figure 1

Patient enrollment processes for the institutional (a) and TCGA (b) cohorts. IDH, isocitrate dehydrogenase; FLAIR, fluid-attenuated inversion recovery images; LGG, lower grade gliomas; T1C, T1-weighted contrast-enhanced images; T2, T2-weighted images; TCGA, The Cancer Genome Atlas; TCIA, The Cancer Imaging Archive

For TCGA cohort, the medical records of 199 patients with available MRI data from the LGG dataset of the TCGA/TCIA database [2] were retrospectively reviewed. The inclusion criteria were as follows: (1) pathologically confirmed WHO grade II and III gliomas; (2) known clinical information, including age, IDH mutation status, and extent of tumor resection; (3) no prior brain biopsy, surgery, or other treatments; and (4) availability of preoperative T2, FLAIR, and T1C images. Based on these criteria, the following were excluded: (1) absence of FLAIR, T2, or T1C images (n = 80); (2) previous history of biopsy or surgery (n = 17); (3) unavailability of information on extent of tumor resection (n = 8); and (4) inadequate image quality (n = 3). Thus, a total of 205 and 91 patients were selected for the institutional and TCGA cohorts, respectively.

IDH mutation status and clinical information

In the institutional cohort, the IDH R132H mutation was detected using peptide nucleic acid–mediated clamping polymerase chain reaction and immunohistochemical analysis. Along with IDH status, WHO grade and the clinical information including age, sex, resection extent, and overall survival (OS) information were recorded. Resection extent was categorized as subtotal (gross tumor removal ≥ 75% but < 100%), or partial (gross tumor removal, < 75%) or biopsy based on postoperative MRI findings. Details on IDH mutation status and clinical information are available in Table 1.

Table 1 Clinical characteristics of the patients

For the TCGA cohort, histological findings and grading of tissues submitted to TCGA were confirmed by neuropathology review, as previously described [2, 15]. The clinical information, WHO grade, and IDH status for the TCGA cohort were recorded, similar to the institutional cohort.

Image acquisition and postprocessing

The image processing workflow is shown in Fig. 2. MR images of the institutional cohort were acquired using a 3.0-T MRI (Achieva, Philips) and an 8-channel SENSE head coil.

Fig. 2
figure 2

Workflow of external validation of radiomics models for predicting OS. GLCM, gray-level co-occurrence matrix; GLRLM, gray-level run-length matrix; GLSZM, gray-level size zone matrix; iAUC, integrated area under the time-dependent ROC curve; IDH, isocitrate dehydrogenase; OS, overall survival; ROC, receiver operating characteristic

The preoperative MRI protocol included acquisition of T1-weighted (repetition time [TR], 2000 ms; echo time [TE], 10 ms; field of view [FOV], 240 mm; slice thickness, 5 mm; and matrix size, 256 × 256); T2 (TR, 3000 ms; TE, 80 ms; FOV, 240 mm; slice thickness, 5 mm; and matrix size, 256 × 256); and FLAIR (TR, 10,000 ms; TE, 125 ms; FOV, 240 mm; slice thickness, 5 mm; and matrix size, 256 × 256) images. Additionally, 3-D T1C images (TR, 6.3 ms; TE, 3.1 ms) were acquired after the administration of a gadolinium-based contrast agent (0.1 ml/kg gadobutrol, Gadovist, Bayer Schering Pharma).

Owing to the difference and heterogeneity of image protocol of TCGA cohort including spatial resolution, T2 images from the institutional and TCGA cohorts were resampled to 2-D spatial resolution of 1 × 1 × 5 mm. FLAIR and T1C images were registered to the resampled T2 images after skull stripping, using the FMRIB software library (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FSL). After image registration, image signal intensity was normalized using the WhiteStripe R package [16]. Regions of interest (ROIs) were drawn to segment the entire tumor (which was defined as a hyperintense area on FLAIR images), including edema, using semiautomatic methods from an open-source application (Medical Image Processing, Analysis and Visualization, available at https://mipav.cit.nih). The ROIs were drawn by a neuroradiologist (Y.S.C., with 9 years of experience) and confirmed by an independent reviewer (S.S.A., with 12 years of experience). A total of 250 radiomic features, including shape (n = 16), first order (n = 19), gray-level co-occurrence matrix (GLCM; n = 27), gray-level run length matrix (GLRLM; n = 16), and gray-level size zone matrix (GLSZM; n = 16), were extracted from ROIs on T1C, T2, and FLAIR images using a Python-based open-source module (Pyradiomics 1.2.0, http://www.radiomics.io/pyradiomics.html), with the following parameter setting: “binWidth”, 1; “force2D”, True [17]. Details of the radiomic features are available at Supplementary Material S1 and https://pyradiomics.readthedocs.io/en/latest/features.html.

Statistical analysis and machine learning

Models to predict OS based on radiomic features, clinical profile, and IDH mutations status were trained using the random survival forest (RSF) method. RSF is an extension method of random forest that intended to analyze right-censored survival data, which uses log-rank test for node splitting instead of logistic regression used in random forest [18]. We chose RSF over commonly used Cox regression–based methods for survival analysis, because RSF does not rely on restrictive assumptions such as proportional hazards, and automatically handles non-linear effects and interactions of hundreds of radiomic features [18]. The feature selection and training of RSF models to predict OS were conducted completely within the institutional data; the TCGA/TCIA data was used for external validation only. The overall workflow of RSF machine learning and other statistical analyses are shown in Fig. 2. All statistical analyses were performed using R package (R version 3.2.4; R Foundation for Statistical Computing). A p value (p) < 0.05 was considered statistically significant.

To evaluate whether radiomic phenotyping improves survival prediction over models established from non-imaging prognostic factors, three RSF models were trained, as follows: (1) a radiomics RSF model based on radiomic features (model 1); (2) a clinical RSF model based on the non-imaging prognostic factors consisting of age, resection extent, WHO grade, and IDH status (model 2); and (3) a combined RSF model based on radiomic features and the non-imaging factors (model 3). For training of the radiomics RSF model (model 1), the 250 radiomic features were subjected to two steps of feature selection using the univariate log-rank test and minimal depth as follows: First, patients were dichotomized on the basis of each radiomic feature, using the optimal cutoff that yielded the most significant survival difference as calculated from the “cutp” function from the “survMisc” R package. Then, the OS from the dichotomized groups of the institutional cohort were compared using the log-rank test. Only those features with p values < 0.05 in the log-rank test were considered for the second step of feature selection using minimal depth. Minimal depth is a dimensionless statistic that measures the predictive ability of a variable in a survival tree of RSF, and has shown to be useful to select relevant features in high-dimensional problems using RSF [19]. The selected radiomic features based on the univariate log-rank test and minimal depth were used to train the radiomics RSF model (model 1). The minimal depth–based feature selection and RSF model training were performed using the “var.select” and “rfsrc” function from “randomForestSRC” of R package, respectively. For clinical RSF model (model 2) training, all the non-imaging prognostic factors consisting of age, resection extent, WHO grade, and IDH status were used without feature selection. Age was considered a continuous variable, while resection extent, WHO grade, and IDH status were considered categorical variables. For the combined RSF model (model 3), the selected radiomic features from model 1 and the non-imaging factors from model 2 were subjected to further feature selection based on minimal depth, where only radiomic features from model 1 were targeted for further feature selection based on minimal depth, and the non-imaging factors from model 2 were retained in the final version of model 3. This process was to remove the unhelpful radiomic features in presence of the non-imaging factors and reduce the total number of features in model 3, while retaining the well-known non-imaging prognostic factors in the final model. As the RSF model training uses intrinsic randomization for selecting bagging samples and features, and prediction error from out of bag samples [18], cross-validation was not separately conducted for RSF model training.

After feature selection and RSF training of models 1–3 within the institutional cohort, models 1–3 were externally validated on TCGA cohort. To measure the performance of each RSF model from external validation, the OS predicted risks of models 1–3 from validation on TCGA cohort were entered into time-dependent receiver operating characteristic (ROC) analysis, and the integrated area under the ROC curve (iAUC) was calculated. To evaluate the incremental value of radiomic features in OS prediction, iAUC differences between model 2 (clinical RSF model) and model 3 (combined RSF model with radiomic features and non-imaging prognostic factors) were calculated with 1000 iterations of bootstrapped resampling. The iAUC difference was considered statistically significant if its 95% confidence interval (95% CI) did not contain a zero value. Instead of one p value, the iAUC differences and their 95% CI values were provided because it has been suggested that providing a bootstrapped CI is preferred whenever there is a cause to doubt the assumed underlying parametric CIs, such as CI for iAUC values [20, 21], and the interval estimation may be more informative than just one p value [22].

In addition, to evaluate whether the radiomics model (model 1) stratifies OS, patients were dichotomized on the basis of the optimal cutoff of the predicted risk in model 1 and OS was compared between these two dichotomized groups using the log-rank test. This optimal cutoff was determined to yield the most significant OS difference between the dichotomized groups and was calculated using the “cutp” function from the “survMisc” module of R package.

Results

Population characteristics

The clinical characteristics of the study cohort are summarized in Table 1. In the institutional cohort (total, n = 205; IDH-mutant gliomas, n = 136; IDH wild-type gliomas, n = 69), the mean age of patients was 44.6 ± 12.7 years, and the male-to-female ratio was 108:97. In TCGA cohort (total, n = 91; IDH-mutant gliomas, n = 71; IDH wild-type gliomas, n = 20), the mean age of patients was 46.8 ± 13.6 years, and the male-to-female ratio was 43:48. Patients from the institutional cohort showed significantly longer OS (median OS duration, 2986 vs. 1859 days; p = 0.008), a non-significant trend towards a lower proportion of IDH mutation (66.3% vs. 78.0%; p = 0.059), and a higher proportion of WHO grade II gliomas (59.5% vs. 46.2%; p = 0.045) compared with the TCGA cohort. The male-to-female ratio and resection extent were not significantly different between the institutional and TCGA cohorts (p = 0.389 and 0.189, respectively).

Improved survival prediction by radiomic phenotyping

For the radiomics model (model 1), 71 radiomic features were selected for predicting OS (Fig. 3 and Supplementary Material S2). The radiomics model (model 1) was successfully validated on the TCGA cohort with an iAUC of 0.620 (95% CI, 0.501–0.756). The predicted risk from model 1 significantly stratified OS in the TCGA cohort (HR, 4.08; 95% CI, 1.40–11.91; log-rank p = 0.007), as well as in the institutional cohort (HR, 11.29; 95% CI, 4.44–28.67; log-rank p < 0.001; Fig. 3). The 71 selected radiomic features consisted of 9 shape features, and 25, 26, and 11 first-order or texture (i.e., GLCM, GLRLM, and GLSZM) features from T1C, T2, and FLAIR, respectively, among which the top 4 most relevant features consisted of shape features. The clinical model (model 2) with non-imaging prognostic factors yielded an iAUC of 0.627 (95% CI, 0.529–0.745) on validation on the TCGA cohort.

Fig. 3
figure 3

Kaplan–Meier curves (a, b) and radiomic feature heat map (c) of the low-risk and high-risk groups stratified by a radiomics RSF model for predicting OS, and time-dependent ROC curve analysis (d) to show the incremental value of radiomic phenotyping over clinical profile and IDH status in the prediction of OS. IDH, isocitrate dehydrogenase; OS, overall survival; ROC, receiver operating characteristic; RSF, random survival forest

When radiomic features from model 1 were integrated into the clinical model (model 2), the newly established combined model 3 showed significantly improved OS prediction, as compared with model 2 (Table 2 and Fig. 3; iAUC, 0.627 vs. 0.709; difference, 0.097; 95% CI, 0.003–0.209). In the further feature selection for model 3, 19 out of 71 radiomic features from model 1 were retained in the presence of the non-imaging prognostic factors. The 19 selected radiomic features consisted of 6 shape features, and 9, 3, and 1 first-order or texture features from T1C, T2, and FLAIR, respectively (Supplementary Material S3). Similar to model 1, the top 4 most relevant features out of 19 radiomic features were shape features. With regard to the non-imaging prognostic factors constituting model 3, the IDH status was the top important feature, and age, resection extent, and WHO grade were 8th, 16th, and 17th important features among total 23 features in model 3, respectively.

Table 2 The iAUCs to measure performances of RSF models in OS prediction

Discussion

We investigated the role of radiomic phenotyping to improve survival prediction for patients with LGG in presence of the non-imaging prognostic factors from molecular and clinical profiles. We performed external validation to test the model generalizability, and found that the accuracy of OS prediction was significantly improved by integrating radiomic features with the non-imaging prognostic factors.

Radiomic phenotyping has shown the potential to predict survival outcome of the patients with gliomas [13, 23,24,25]. However, many previous studies were based on a single institutional cohort [13, 23, 24]. As radiomic features are likely sensitive to differences in MRI protocols from different institutions, our study was conducted with external validation to assess the generalizability and robustness of the radiomic phenotyping models across heterogeneous MRI protocols and clinical composition of the patient population. Clinically, many patients with LGG undergo surgery, and survival outcome is predicted based on various non-imaging prognostic factors that was obtained before and after surgery, such as IDH mutation status, age, WHO grade, and extent of tumor resection. The IDH mutation is the most important prognostic biomarker in gliomas that recently integrated into the 2016 WHO guidelines for classification of glioma [26], and is associated with better prognosis and response to chemoradiotherapy than IDH wild-type gliomas of the same histopathological grade [27]. Therefore, IDH status and other non-imaging prognostic information should be considered when investigating the role of imaging biomarkers in survival prediction. Previous studies [24, 28] reported that the radiomic features extracted from preoperative MRI can help improve survival prediction beyond molecular and clinical profiles in patients' glioblastoma. However, the role of radiomic features in LGG in consideration of non-imaging prognostic factors has not been well established yet; one study [25] established and externally validated a nomogram that was built upon clinicopathologic risk factors and a radiomic risk score from T2 images, to predict progression-free survival of LGG patients. However, the addition of radiomic risk score to the nomogram yielded relatively small improvement in survival prediction (C-index, 0.668 vs. 0.684 and 0.815 vs. 0.823 for the nomograms before and after the addition of radiomic risk score to the clinicopathologic risk factors, on the training and validation sets, respectively), and the significance of this improvement by radiomic risk score was not tested in this study [25]. On the contrary, our study focused on the incremental value of radiomic features within non-imaging prognostic factors, and allowed flexibility in radiomic feature selection in consideration of the non-imaging factors for model 3.

In the feature selection for models 1 and 3 of our study, shape features were constantly selected as the top relevant features. This result is in line with previous results that reported the association between tumor shape and prognosis [28,29,30]; one study [30] reported that tumor surface irregularity measuring how much the tumor surface deviates from a sphere of the sample volume, was a powerful predictor of survival outcome in glioblastoma patients. Another study [28] identified 18 radiomic features that predicted OS of glioblastoma patients, and 7 out of 18 features were shape features. Aside from shape features, it is also noteworthy that all the texture and first-order features that were retained in model 3 were from either T1C or T2, except one feature from FLAIR. This implies that FLAIR sequence may not be mandatory to achieve compatible performance from the combination of radiomic features and non-imaging factors. In addition, given that 19 out of 71 radiomic features from model 1 were retained in model 3, our results imply that a subset of features from radiomic phenotyping might be independent from the non-imaging prognostic information, thus providing an added value for survival prediction. However, although better than random prediction, the performance of survival prediction by radiomic features alone (model 1) was not optimal probably owing to differences in MRI protocols and the clinical composition of the patient populations, including the different incidences of IDH mutation. In future investigations aimed at enhancing generalizability, the radiomic feature–based model should be trained on a more heterogeneous cohort that includes external cohort samples, or a prospective study can be conducted where protocols are specified, thereby mitigating potential differences in imaging protocols among institutions. As seen in the model with radiomic features, the model with non-imaging prognostic factors also showed suboptimal performance for OS prediction. These results suggest that compromised model performance from limited generalizability in different cohorts is a general issue across various non-imaging prognostic factors, as well as in radiomics. Nonetheless, our results indicate that the added prognostic value of radiomic features over non-imaging prognostic factors can be retained in an external cohort, despite the limited performance of individual models.

Our study has several limitations. First, the number of clinical prognostic factors was limited in the TCGA dataset, so that other important clinical factors such as Karnofsky performance status and postoperative treatment were not considered in the clinical RSF model; therefore, this clinical RSF model might not thoroughly reflect the performance of more comprehensive clinical factors in real-world clinical settings. Future investigations are necessary to confirm the added prognostic value of radiomic features when more comprehensive clinical prognostic information with uncompensated prognostic values are provided within a large and homogeneous cohort. Second, due to the heterogeneity of MRI protocols, 2-D spatial resolution (i.e., 1 × 1 × 5 mm) was used for image resampling, which is below the resolution of 3-D isotropic T1C from current consensus recommendations [31]. This is because a considerable number of samples from TCGA data had 2-D T1C images, and we thought that resampling TCGA set images to a higher resolution than the original resolution may synthesize non-informative pixels, which may negatively affect the model performance from external validation on the TCGA set. The performance of the radiomics model may vary depending on MRI protocol, especially spatial resolution of the original images, which may limit the reproducibility of model performance. However, in our opinion, the strength of our study lies in that the potential of radiomic phenotyping for improving survival prediction was demonstrated by resolving this heterogeneity of original images with postprocessing techniques.

Conclusions

We have shown that radiomic phenotyping using machine learning tools improves survival prediction in LGGs over clinical profile and IDH status. Our validation on a multi-institute TCGA cohort suggests that analytic tools developed using machine learning and radiomics analysis could be employed across different scanners and institutes.