Introduction

Glioblastoma is one of the most common tumors in the Central Nervous System (CNS), accounting for nearly one-half of malignant brain tumors [1]. For almost twenty years, the standard treatment for GBM has remained concomitant radiation and temozolomide [2,3,4]. This standard, along with several failed clinical trials, has led to only minor improvements in overall survival (OS) for GBM over the past two decades, with a median survival of 12–15 months[3, 5, 6]. The failure of GBM clinical trials may reflect the complex genetic heterogeneity of these tumors, with several genetic alterations and subtypes [7]. New, targeted therapies and immunotherapies have great promise for improving survival in GBM[8]. One of the challenges in clinical trials in GBM is the development of tools to pre-operatively stratify patients by survival or genetic alterations.

Radiomics is a semi-automated, high-throughput method that correlates quantitative imaging features with outcomes, including tumor pathology, genetics, recurrence, or patient survival [9,10,11,12,13]. One of the promising features of radiomics is the ability to use common diagnostic tools, like MRI, to classify patients noninvasively [14,15,16]. For GBM specifically, several studies have shown that radiomics can identify IDH mutations, novel genetic alterations that influence outcomes, and long-term survivors [12, 14, 17,18,19]. These tools could allow for improved pre-operative counseling for patients and families, better identification of patients who would benefit from tumor resection, and help guide clinical trials [20]. Despite this potential, only a few studies have evaluated radiomics in patient care settings [21, 22]. To ease the transition from a research tool to clinical resource, we propose the development of a pre-operative patient report card that can identify common patient genetic alterations and predict OS. This tool can be used both on an individual patient level to guide care and better stratify patients for clinical trials. We hypothesized that a radiomics-based tool could robustly evaluate overall survival, common genetic alterations, and genetic subtypes commonly described in the literature [7].

Materials and methods

Our study was approved by the University of Texas MD Anderson Cancer Center (MDACC) and University of Pittsburgh Medical Center (UPMC) institutional review boards. The requirement for informed consent was waived. We followed the “STAndards for Reporting Diagnostic Accuracy studies” (STARD) checklist [23]. For our study, we classified patients across four domains: EGFR amplification, MGMT promoter methylation, molecular subgroup (classical, proneural, mesenchymal) based on Verhaak et al. [7], and OS. For OS, we classified patients as surviving greater than (OS > 12) or less than (OS < 12) 12 months. Of note, for molecular subgroup, we did not classify patients in the neural group as this group has been subsequently found to be a background signature and not driven by GBM pathology [24].

Study cohort

We retrospectively identified a cohort of treatment naïve, pathologically confirmed GBMs from the MDACC from 2002 to 2019 and TCGA [25]. The preoperative MRI studies of the TCGA patients are available for public download from The Cancer Imaging Archive (TCIA) (http://cancerimagingarchive.net/). Patient inclusion criteria were as follows: histopathologic confirmation of GBM; preoperative MRI sequences including contrast-enhanced axial T1-weighted imaging (ceT1WI) and pre-contrast axial T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR); and availability of clinical and genomic data. MDACC cohort patients underwent standard treatment of resection followed by Stupp Protocol with six weeks of concomitant Temozolomide and radiation therapy. For all patients, we collected EGFR amplification status, MGMT promoter methylation, OS, and molecular subtype if available.

Image registration, segmentation, and pre-processing

We have previously described our image segmenting, processing, and radiomics pipeline [14, 15, 26]. In brief, registered each MRI in the same geometric space using the registration toolbox from 3D Slicer, an open source analytics platform. Then, we segmented the ceT1WI and FLAIR sequences for all patients. All TCGA and MDACC imaging studies were performed using clinical MRI scanners with field strengths ranging from 1 to 3 T. Acquisition parameters, including slice thickness, voxel size, and slice gap, are reported in Supplementary Tables 1 and 2. Tumor segmentation was performed semi-automatically using 3D Slicer version 4.3.1 (www.slicer.org), an open-source image analytics platform for image processing and segmentation [27,28,29]. We segmented four distinct imaging phenotypes: edema/invasion, contrast enhancement, necrosis, and whole tumor volume [14]. Additionally, a region of contralateral hemisphere normal-appearing white matter was segmented for within-sequence normalization. All segmented images were reviewed by consensus by two board-certified neuroradiologists with 9 (R.R.C.) and 35 (A.J.K.) years of experience.

After segmenting all scans, we used FMRIB’s Brain Extraction Tool (BET) (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/BET) to remove non-brain tissue. Using the ceT1WI, we developed a brain mask using BET that was subsequently applied to the FLAIR sequences, thus removing non-brain tissue from those images also. To account for scanner differences, we applied Nyul intensity normalization algorithm to standardize the intensity scales across MR images of the same contrast [30].

Textural radiomic sequencing

We performed whole radiomic sequencing to extract features from the ceT1WI and FLAIR sequences, which can be seen in Fig. 1.

Fig. 1
figure 1

Radiomics pipeline

For each phenotype and sequence, we extracted 80 first-order features in the form of an intensity-level histogram and 4800 s-order features in the form of gray-level co-occurrence matrices (GLCM). The intensity-level histogram is a function showing the number of voxels with a specific intensity in the segmented volume. From each histogram, we extracted ten features, including minimum, maximum, mean, standard deviation, skew, kurtosis, and four percentiles, per phenotype per sequence [15]. As we extracted 10 first-order features for all 4 phenotypes for both FLAIR and ceT1WI, we had 80 total first-order features. A GLCM is a two-dimensional analysis of an image’s texture, evaluating the frequency of two pixels with gray intensity levels occurring at specified distances and angular relationships. In our pipeline, we evaluated adjacent pixels (distance = 1) in the four angular directions of a 2D plane (i.e., a single slice). We obtained 60 rotation-invariant features for each volume for every gray level and then discretized this into 5 Gy levels to increase the signal-to-noise ratio. This allows us to collect 300 features per phenotype. Considering we have 4 phenotypes (edema/invasion; enhancing tumor; necrosis; whole tumor volume) and two sequences (ceT1WI and FLAIR), we collected 2,400 GLCM features per patient. Further, we averaged this entire feature set by tumor volume to create another, volume-independent set of 2,400 features. Lastly, we collected approximately 600 radiomic features using similar techniques from the contralateral hemisphere normal-appearing white matter for normalization. This helped control for different imaging hardware and software used across MDACC and TCGA.

Radiomic textural analysis

We used radiomic features to classify patients across four domains: EGFR amplification, MGMT promoter methylation, molecular subgroup, and OS. To do this, we applied the Maximum Relevance Minimum Redundancy technique, which allows for efficient selection of the most relevant and non-redundant/uncorrelated features.[31] We performed this selection using the total of 4,880 radiomic features available for each patient to identify the 100 most relevant features. We used the F-statistic to determine feature relevance and feature redundancy with the Pearson correlation coefficient. With our selected features, we built radiomic models using a support vector machine (SVM) classifier, a machine-learning technique known to reduce the potential risk of over-fitting or false discovery [32].

For each domain analyzed (i.e., EGFR amplification, MGMT methylation, molecular subgroup, OS), we analyzed our model with two validation approaches. First, we performed leave-one-out-cross-validation (LOOCV) on each task and reported our results. Second, we trained our model with data from one dataset and tested the model with the other data set. For each domain tested, we used the larger dataset for training and the smaller dataset for testing. For example, if more patients from TCGA had EGFR amplification available compared to MDACC, we would build our model with TCGA and use MDACC for model testing.

We evaluated the performance of our model by computing receiver operating characteristic curves and reporting area under the curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). We also performed Kaplan–Meier and Multivariable Cox Regression Analysis to evaluate survival. Log Rank test and Hazard Ratios are reported. We used R (version 3.3.1, R Foundation for Statistical Computing, Vienna, Austria) for all the statistical analyses, Package mRMRe (version 2.0.5) for the feature selection task, and the Machine Learning package mlr (version 2.8) to build the SVM classifier. The survival package (version 2.39–5) was used for Kaplan Meier and Multivariate Cox Regression Analysis. ROC analysis was performed using pROC package (version 1.8).

Results

Clinical characteristics of study cohort

Overall, we identified 235 pathologically confirmed GBM patients: 88 from TCGA and 147 from MDACC. EGFR amplification status was available for 84 and 37 patients in the TCGA and MDACC cohorts, respectively. MGMT promoter methylation was available for 86 and 28 patients in the TCGA and MDACC cohorts, respectively. TCGA was used for model training and MDACC for model validation for EGFR amplification and MGMT promoter methylation, while the opposite was used for OS. Molecular subtype data was available for 66 patients (24 mesenchymal, 13 classical, and 29 proneural) from TCGA via cBioPortal.[33] The number of patients, baseline demographics, and clinical characteristics of the study cohorts are summarized in Tables 1, 2, and 3. Generally, patients with OS > 12 months were younger and had higher KPS.

Table 1 TCGA patient demographic table
Table 2 TCGA patient demographic table: comparison of the demographic and clinical characteristics of the patients according to GBM subgroup
Table 3 MDACC patient demographic table: comparison of the demographic and clinical characteristics of the MDACC patients according to OS, EGFR and MGMT

Performance of radiomics models for genetic alterations

We performed LOOCV and external validation for predicting EGFR amplification and MGMT promoter methylation status; and LOOCV for molecular subgroup analysis as this data was only available for the TCGA dataset (Fig. 2). Descriptions of these features can be found in Supplementary Table 3. LOOCV across the TCGA and MDACC cohorts had high AUC for identifying EGFR amplification (n = 121; AUC 83.5%, sensitivity 70.3%, specificity 86.0%, PPV 72.1%, NPV 84.9%; Fig. 2A and Tables 4, 5). When validating the TCGA model on the hold-out MDACC cohort, the model performed similarly well (n = 37; AUC 83.6%, sensitivity 85.7%, specificity 75.0%, PPV 81.8%, NPV 80.0%;).

Fig. 2
figure 2

Model performance for predicting EGFR amplification (A), MGMT methylation (B), and molecular subgroup (C). For EGFR Amplification and MGMT Methylation prediction, we reported performance in two ways. First, we performed LOOCV on both cohorts. Second, we developed a prediction model with the cohort with the largest number of patients and validated this model with the other cohort. For example, more patients had EGFR amplification recorded in TCGA study (84) than the MDACC (37) cohort. As a result, we performed LOOCV on both cohorts to describe performance (Left). We then built a model using TCGA patients and validated it on the MDACC cohort (Right). D shows the three-way confusion matrix for molecular subgroups

Table 4 Model prediction performance based on development validation cohorts
Table 5 Multivariate cox models with radiomic sequencing, age, and KPS

We used a similar set of radiomic features for the MGMT model that can be found in Supplementary Table 4. LOOCV of MGMT promoter methylation performed well (n = 114; AUC 85.9%, sensitivity 67.9%, specificity 91.8%, PPV 76.7%, NPV 87.9%; Fig. 2B). Our validation model trained on the TCGA dataset predicted MGMT promoter methylation status in the MDACC cohort (n = 28; AUC 91.80%, sensitivity 84.6%, specificity 93.3%, PPV 91.7%, NPV 87.5%).

Lastly, we determined the molecular subgroup in the TGCA set with a three-way classifier and LOOCV predictive modeling for each subgroup against the other two (Fig. 2C). Our model exhibited high performance for predicting the mesenchymal group (n = 24; green line; AUC 95.7%, specificity 85.7%, sensitivity 95.8%), proneural group (n = 29; AUC 93.1%, specificity 89.2%, sensitivity 93.1%), and classical group (n = 13; red line; AUC 92.5%, specificity 94.3%, sensitivity 84.6%). Figure 2D shows a three-way confusion matrix demonstrating with overall accuracy of 84.9%. Supplementary Table 5 lists the radiomic features used in the model identifying subgroups.

Radiomic features for predicting OS

OS, age, and KPS were available for 81 patients in TCGA and 137 in MDACC. The median survival time of the MDACC cohort (22 months) was substantially longer than that of the TCGA cohort (14 months; p < 0.001; Supplementary Fig. 1). For both the TCGA and MDACC datasets, we identified the top 100 radiomic features differing between patients with an OS > 12 months and OS < 12 months (Supplementary Table 6). LOOCV in the TCGA cohort (N = 81; AUC 93.3%, sensitivity 97.6%, specificity 76.9%, PPV 82.0%, NPV 96.8%; Fig. 3A/B) and MDACC cohort (N = 137; AUC 91.6%, sensitivity 90.7%, specificity 82.8%, PPV 95.2%, NPV 70.6%; Fig. 3C/D) performed well. Moreover, our radiomic features significantly predicted survival in TCGA (p < 0.001) and MDACC (p < 0.001) cohorts in a multivariate model with age (TCGA p = 0.05, MDACC p < 0.01) and KPS (TCGA p < 0.01, MDACC p = 0.92). Lastly, we built an OS prediction model built from MDACC patients and validated on the TCGA dataset that demonstrates the robustness of our approach (N = 81, AUC 70.5%, sensitivity 66.7%, specificity 69.2%, PPV 70.0%, NPV 65.9%; Fig. 3E; cognate Kaplan–Meier plot p = 0.02; Fig. 3F).

Fig. 3
figure 3

Model performance for predicting OS in the TCGA dataset (A/B) and MDACC (C/D) using LOOCV for both datasets. We validated our approach through building a survival model with MDACC patients (median OS 22 months) and validating on the TCGA dataset (medial OS 14 months; E/F)

Radiogenomic probability map and clinical report cards for predicting personalized patient profiles

To demonstrate clinical applicability and benefit for individual patients, we developed personalized probability maps for key molecular events and survival (Fig. 4). As depicted, this radiomics analysis paradigm uses the analysis pipeline performed above, with routine patient MRI scans as input, to concisely predict EGFR amplification, MGMT promoter methylation, GBM subtype, and OS pre-operatively. Figure 4 shows the report card for 3 sample patients, as well as whether the predictions were correct.

Fig. 4
figure 4

Radiogenomic probability maps and clinical report cards for three representative patients. This figure presents a summary of the radiomics-based prediction of the genomic hallmarks of GBM and survival for three representative patients. The top row demonstrates the segmented patients brain MRIs. The center row shows the radiomic probability map (as surrogate for radiomic sequencing output) for key genomic markers and survival. The bottom row shows the clinical report card summarizing the predicted probabilities for EGFR amplification, MGMT methylation, GBM subtype, and OS. A check mark or “X” indicates whether the respective prediction was correct or incorrect, respectively

Discussion

In this study, we developed a robust radiomics pipeline that predicted OS and common genetic alterations in a multi-institutional dataset using pre-operative MRI scans. Our models successfully identified patients with MGMT methylation (AUC > 0.86), EGFR amplification (AUC > 0.83), and GBM subtype (AUC > 0.93). MGMT promoter methylation is an important prognostic epigenetic alteration that confers resistance to DNA-alkylating agents (e.g., temozolomide), the standard-of-care chemotherapy for GBM [4]. EGFR amplification is a common genetic alteration associated with more aggressive tumor behavior and is a potentially targetable genetic alteration [34]. While previous groups have identified MGMT methylation [35], other genetic alterations, including EGFR amplification and GBM subtype are less well validated. While Le et al. previously reported predictive accuracies for 70.9%, 73.3%, and 88.4% for classical, mesenchymal, and proneural subtypes respectively, here, our models improved upon this with accuracies of 92.4%, 95.7%, and 93.1% for classical, mesenchymal, and proneural subtypes [36]. Our ability to identify several alterations highlights the ability of an MRI-based radiomics model to pre-operatively risk-stratify patients.

One of the strengths of our modeling approach is its stability and generalizability. We took steps, including normalizing each MRI with normal-appearing white matter from the contralateral hemisphere and creating volume-independent radiomic features, allowing our approach to retain performance across multiple institutions and MRI vendors. Our model accuracy improved for EGFR amplification and MGMT methylation during the external validation compared to LOOCV. Similarly, for OS, although the MDACC had a significantly different OS than the TCGA dataset, the external validation retained robust performance (AUC 70.5). In an ideal clinical setting, radiomic survival models could be tuned to the survival patterns of the hospital, therefore mimicking the LOOCV approach where the AUC was > 0.91. Generalizable approaches, validated across multiple institutions, are key to the successful deployment of radiomics in a clinical environment.

Another strength of our approach is our high model accuracy. While other groups have developed radiomic models in the past, several are limited by lower performance[17, 18, 37, 38] or a lack of external validation [35, 39]. Our model performance, in excess of an AUC of 0.83 for most tests, likely reflects our pipeline that generates thousands of first and second-order radiomic features and robust statistical approaches to identify salient features.

We also introduced the concept of a patient report card that highlights the likelihood of certain genetic alterations and chances for higher OS. This report card can be developed pre-operatively using non-invasive tools like MRI and common clinical information. This type of tool could be deployed in clinical settings to help patients better understand their disease course and help pre-operatively balance risks versus rewards for surgery. A patient report card could also help identify genetic targets for future neo-adjuvant chemotherapy or better balance patients in clinical trials. Further work is needed to deploy this tool in real-world clinical settings.

While our approach has significant potential to enhance GBM clinical care, several limitations exist. Most importantly, many machine learning programs suffer significant performance declines when deployed in real-world settings [40]. The deployment of any radiomics model in clinical practice should be evaluated rigorously. Additionally, while our methods for feature extraction are semi-automated, manually segmenting MRIs requires manual effort and may differ between providers performing segmentation.

Conclusion

In conclusion, we developed a robust radiomics pipeline that pre-operatively predicted EGFR amplification, MGMT methylation, GBM subtype, and OS. Our models retain robust performance when validated in external datasets, highlighting our generalizable approach. Future studies should evaluate deploying radiomics models in clinical practice.