Introduction

Gliomas are the most common primary intracranial tumor, accounting for 81% of all malignant brain tumors (Ostrom et al. 2014). According to the World Health Organization (WHO) criteria, gliomas are assigned grades I–IV indicating different degrees of malignancy (Weller et al. 2015). Patients with gliomas of different grades are treated with significantly different surgical plans, radiotherapy, and adjuvant chemotherapy strategies (Weller et al. 2014, 2017). Low-grade gliomas (LGG, grade I–II) are typically associated with a longer life expectancy than high-grade gliomas (HGG, grade III–IV) (Weller et al. 2015). Grade I gliomas are characterized by slow growth and high possibility of cure by surgical resection alone (Weller et al. 2015). Glioblastomas (GBM, grade IV) are the most aggressive tumor type, with the median survival time of about 12–15 months (Louis et al. 2016). In contrast to GBMs, the other grade gliomas (OGG, grade II–III) show more favorable outcomes and share similar histopathologic and genomic characteristics (Louis et al. 2016). According to the WHO tumor classification, the mutation status of isocitrate dehydrogenase (IDH) 1 codon 132 or IDH2 codon 172 plays a major role in diagnosing and treating gliomas (Louis et al. 2016; Weller et al. 2017). The presence of IDH mutation (IDH(+)) distinguishes glioma entities with distinct biology. In the case of gliomas, patients carrying IDH(+) usually have a significant favorable response to treatment and outcomes compared to the patients with its wild type (IDH(−)) (Reifenberger et al. 2017; Weller et al. 2015). Loss of nuclear alpha thalassemia/mental retardation syndrome X-linked gene (ATRX) expression has never been found to be accompanied by 1p/19q codeletion (Ikemura et al. 2016) and is characteristic in diagnosing diffuse astrocytomas rather than oligodendrogliomas (Louis et al. 2016). ATRX loss (ATRX(−)) is also associated with poor outcomes in LGG patients (Ogishima et al. 2017). Therefore, it is critical to distinguish LGG from HGG, GBM from OGG, and to identify IDH and ATRX subtypes. Currently, histopathology procedures, immunohistochemistry, or sequencing following biopsy or surgical resection are the main methods used for glioma grading and molecular subtyping; however, all of these methods are invasive (Ferris et al. 2017).

Preoperative noninvasive diagnosis of gliomas is made mainly based on conventional magnetic resonance imaging (MRI) such as T2 fluid-attenuated inversion recovery (T2 FLAIR) and contrast-enhanced T1-weighted images (T1WI + C), but with limited value in grading and genetic classification (Ly et al. 2020). Pathologically, HGG exhibit a higher rate of vascular proliferation, microhemorrhages, and small vessels than LGG (Ferris et al. 2017). Intratumoral calcification tends to be found in 1p/19 co-deleted oligodendrogliomas, which are associated with improved prognosis and responsiveness to therapy (Saito et al. 2016). Different forms of iron in blood products and calcification inside gliomas result in susceptibility variations, all of which can be detected by quantitative susceptibility mapping (QSM) (Wang et al. 2017a, b).

QSM has become a sensitive and reliable quantitative technique to determine the bulk magnetic susceptibility distribution caused by iron load of tissues (Langkammer et al. 2012). By identifying the magnetic field produced by tissue susceptibility and solving the field-to-magnetization (tissue susceptibility) inverse problem (Kee et al. 2017), QSM deconvolves the blooming artifacts in gradient-echo (GRE) phase data and shows much better contrast than R2* or T2 methods (Haacke et al. 2015). So far, QSM has been widely used in the quantitative study of brain iron content (Li et al. 2019). Previous studies have shown the application of QSM in distinguishing HGG with hemorrhage and less aggressive brain tumors with or without calcification (Bandt et al. 2019), and differentiation between blood deposits and calcification in GBMs (Deistung et al. 2013).

Recently, machine learning (ML) has been developed to capture complex patterns in imaging data that are beyond human perception and provide quantitative evaluation of radiographic features for data-driven prediction tasks (Hosny et al. 2018; Lotan et al. 2019). A large amount of ML-assisted research has been applied to determine histological glioma grade, molecular profiles, and prognosis (Lotan et al. 2019). With the huge success of convolutional neural networks (CNN) (especially ResNet and its variants etc.) in medical image classification tasks (Cheng et al. 2022), deep learning (DL) can build a pipeline for feature extraction and image classification, while reducing the subjective bias of manual feature extraction. Especially, inception module and pyramid module can benefit the multi-scale feature extraction, which is significant in medical image classification because the region of interest (ROI) may be small and discrete. However, a dataset with a small number of samples may suffer from an overfitting problem. Inspired by the semi-supervised learning strategy for training a consistency loss, we proposed an inception CNN encoder with consistency loss computed by output of adjacent slices for glioma diagnosis. The purpose of this study was to explore the value of DL-assisted QSM in the prediction of glioma grades, IDH1(+), and ATRX(−).

Materials and Methods

Study Population

This study was approved by the Ethics Review Board of our institution, and individual consent for this retrospective analysis was waived. A total of 51 patients with clinically suspected gliomas underwent a unified preoperative MR examination protocol. These patients met three main criteria: (1) clinically newly diagnosed primary brain gliomas without any pharmacotherapy or radiotherapy; (2) available histopathological diagnosis and molecular genetic characteristics including IDH and ATRX; (3) available preoperative MR images, including QSM, T2 FLAIR, and T1WI + C sequences. Nine patients were excluded for the following reasons: (1) uncommon pathological diagnosis of gliosarcoma (n = 1) and pleomorphic xanthoastrocytoma (n = 1); (2) image artifacts owing to patient movement (n = 3); and (3) error in QSM image processing (n = 4). Finally, 42 patients (18 female and 24 male, mean age: 47 years, age range: 26–75 years) were enrolled in this study (Fig. 1).

Fig. 1
figure 1

Flowchart of the study population. IDH1(+): IDH1 mutation; IDH1(−): IDH1 wildtype; ATRX(−): ATRX expression loss; ATRX(+): ATRX retention

In the present study, three stratified detection tasks were designed: (1) the detection of glioma grades (LGG and HGG, OGG and GBM); (2) the detection of gliomas with IDH1(+) or IDH1(−); (3) the detection of IDH1 mutated glioma with ATRX(−) or ATRX retention (ATRX(+)).

Pathology Data Collection

The histopathologic grading and molecular subtyping data of gliomas were collected from an electronic database in the Neuropathology Department of our institution. Each resection specimen was sectioned and stained with hematoxylin and eosin. IDH1 and ATRX molecular status were tested by immunohistochemical staining. According to the latest WHO classification, two experienced neuropathologists consistently diagnosed the glioma grades and performed molecular classification.

MRI Acquisition

MR images were acquired on a 3.0T MRI system (Discovery 750; GE Healthcare, Milwaukee, WI) with an eight-channel phased-array head coil (GE Medical Systems). The QSM was generated from a three-dimensional multi-echo GRE sequence. Each patient received the unified preoperative MR scan in the following order: axial T1WI, T2-weighted image (T2WI), T2 FLAIR, multi-echo GRE, and T1WI + C sequences. Specific parameters were as follows: T1WI with FLAIR technique: repetition time (TR)/echo time (TE) = 3195/24 ms, field of view (FOV) = 240 × 240 mm, matrix size = 256 × 256, slice thickness = 4 mm, number of slices = 28; T2WI: TR/TE = 9185/108 ms, FOV = 240 × 240 mm, matrix size = 256 × 256, slice thickness = 4 mm, number of slices = 28; T2 FLAIR: TR/TE = 9491/140 ms, inversion time = 2200 ms, FOV = 240 × 240 mm, matrix size = 256 × 256, slice thickness = 4 mm, number of slices = 28; and multi-echo GRE: TR = 41.6 ms, number of echoes = 16, first TE = 3.2 ms, TE spacing = 2.4 ms, bandwidth = 62.50 kHz, flip angle = 12°, FOV = 256 × 256 mm, matrix size = 256 × 256, slice thickness = 1 mm, voxel size = 1 × 1 × 1 mm3, number of slices = 140, acceleration factor = 2, acquisition time = 9 min. Array spatial sensitivity encoding technique was employed to accelerate the multi-echo GRE. Contrast-enhanced images were obtained immediately after administering a standard dose (0.1 mmol/kg body weight) of gadopentetate dimeglumine (Beilu, Beijing, China) at approximately 3–4 mL/s via the dorsal hand or elbow vein.

Image Reconstruction and Tumor Segmentation

QSM reconstructions were performed using susceptibility tensor imaging (STI) Suite software (Duke University) with reference to the previous studies (Li et al. 2011, 2015). First, the multi-echo phase images for each channel of the coil were collected and then averaged after subtracting the receiver phase of each channel. The phase was unwrapped with a Laplacian method. The unwrapped phase images were normalized by the corresponding echo times and averaged to determine the frequency shift. Second, the background phase was removed by the sophisticated harmonic artifact reduction for phase data method, and the filter radius was set as eight (Schweser et al. 2011). Third, the susceptibility map of the brain tissue was obtained from the frequency map by an improved least-squares method, and the regulatory threshold of Laplace filtering was set to 0.04 (Li et al. 2011, 2015).

Tumor segmentation was performed on axial T2 FLAIR, T1WI + C, and QSM images, respectively, using the Insight Toolkit-SNAP program (University of Pennsylvania, www. itksnap.org). With reference to T1WI, T2WI, T2 FLAIR, and T1WI + C, one neuroradiologist (WTR), who was blinded to the histopathologic and molecular information, delineated the tumor ROI guided by an experienced neuroradiologist (ZWY) with 20 years’ experience in neuroradiology using the same criteria: at the image section with maximum diameter of solid tumor in each sequence, an arbitrarily shaped ROI was delineated around the area of tumor but avoiding peritumoral edema as much as possible (Rui et al. 2018).

Deep Learning

To avoid the potential data gap between training, validation and test dataset, intensity normalization across each individual modality was conducted. Then, MRI slices were center cropped, minimizing the impact of background. There was a binary mask over the specific slice, which could be a weak annotation to help guide the classification task. The binary mask was concatenated in the channel dimension for the guidance. As shown in Fig. 2, there are two paths in our proposed inception CNN—labeled path for annotated MRI slices and label-free path for unlabeled data. To automatically extract the efficient feature from the MRI slices, an inception module was introduced to capture potential biomarkers for glioma classification. For each path, three inception layers were employed to enhance the representation ability, followed by a global average pooling layer and a linear layer. For the multi-modality data, image slices were concatenated in the channel dimension (e.g., T1WI + C, T2 FLAIR, QSM for three modalities). To alleviate the influence of dataset split bias, fivefold cross-validation was leveraged and the ratio of training and validation and test dataset was 4:1:1. The classification algorithms aim to classify specific models via targeted metrics. For the grading task, we denoted the true labels with numbers 1, 2 and 3 corresponding to grade II, III and IV. For the prediction of IDH1 mutation, the labels were divided into two classes: + and −. For the prediction of ATRX loss, − and + were the binary classification labels. For the proposed inception CNN and algorithms for comparison, each fold of dataset split was the same. The deep learning methods including inception CNN and standard CNN were trained through loss back propagation. For the corresponding details, refer to Supplementary Material 1.

Fig. 2
figure 2

Pipeline of proposed inception convolutional neural network. For the ensemble of multi-modality, image slices of the same subject are concatenated in the channel dimension. Only one slice is annotated with a tumor mask, which is fed to the inception CNN for supervision and the adjacent slices are fed to the shared-weight encoder to achieve the similar output

The model performance (recorded as mean ± standard deviation) was evaluated by computing the total prediction accuracy, sensitivity/recall, specificity, positive predictive value (PPV)/precision, negative predictive value (NPV), and F1 score (defined as the harmonic mean of sensitivity and PPV) using the confusion matrix. A receiver operating characteristic (ROC) curve was also drawn and the area under the curve (AUC) was calculated to assess the discriminative ability of the model in the test dataset. To present the advantage of QSM in glioma classification, t-distributed stochastic neighbor embedding (t-SNE) and Shapley value analysis were employed to visualize the feature distribution of different modalities and the contribution to the model output of each modality. The standard pipeline of t-SNE was from manifold package in the scikit-learn, and the code of Shapley value analysis followed the official demo. To illustrate the effectiveness of proposed inception CNN, support vector machine (SVM), standard CNN (three-layer), and generalized linear model (GLM) were used for comparisons in the three-modality fusion classification tasks.

Results

Independent Efficiency of T2 FLAIR, T1WI + C, and QSM by DL in Glioma Tasks

The diagnostic performance of each MRI modality by DL for predicting glioma grades and the molecular subtype is listed in Table 1. For the grading task, QSM modality showed higher diagnostic accuracy of 0.80 and F1 score of 0.75 in differentiating OGG from GBM than T2 FLAIR (0.69, 0.60) and T1WI + C (0.74, 0.70); however, T1WI + C modality (0.74, 0.72) performed better in distinguishing LGG from HGG than QSM (0.69, 0.64) and T2 FLAIR (0.69, 0.61). For the IDH1 task, QSM modality (accuracy: 0.77, F1 score: 0.70) was superior to T2 FLAIR (0.57, 0.53) and T1WI + C (0.57, 0.52) in predicting IDH1(+). For the ATRX task, QSM modality (accuracy: 0.60, F1 score: 0.52) showed a better performance than T2 FLAIR (0.54, 0.46) and T1WI + C (0.46, 0.42) in diagnosing ATRX(−). The figures of exemplary gliomas of different grades and phenotypes can be found in the Supplementary Figs. 1–3.

Table 1 Comparisons of the performance for prediction of glioma grades and molecular subtypes on different MR modalities by fivefold cross-validation of deep learning

Importance of Different MRI Modalities and Clinical Features in Glioma Classification

Feature importance explanations showed that QSM modality feature was the most important variable for the classification of OGG/HGG, IDH1(+) or IDH1(−), ATRX(−), or ATRX(+) (Fig. 3). The spread of the Shapley values reflects the corresponding impacts on the model output of four classification tasks. t-SNE plot showed QSM modality feature extracted by inception network could partially identify GBM patients from OGG patients and distinguish IDH1(+) or IDH1(−), while T1WI + C and T2 FLAIR modality features could not (Fig. 4).

Fig. 3
figure 3

Shapley values for analyzing importance scores of image features extracted by inception network, when the extracted features are combined with the non-imaging features (sex, age) for glioma classification (a for OGG/GBM, b for LGG/HGG, c for IDH1 (±), and d for ATRX (±))

Fig. 4
figure 4

t-distributed stochastic neighbor embedding (t-SNE) visualization of modality (T1WI, T2 FLAIR, and QSM) feature distribution of four glioma classification tasks (a for OGG/GBM, b for LGG/HGG, c for IDH1 (±) and d for ATRX (±))

Added Value of QSM in Contrast to Conventional MRI

Performance metrics of multi-modal MRI in glioma tasks by DL are presented in Table 2. For the OGG and GBM classification, T1WI + C plus T2 FLAIR showed a diagnostic accuracy of 0.80 and F1 score of 0.78, and improved diagnostic efficiency could be achieved by adding the QSM modality (accuracy: 0.89, F1-score: 0.87, sensitivity or recall: 0.90, specificity: 0.87, PPV or precision: 0.91, NPV: 0.82). Similarly, better diagnostic performance (accuracy: 0.86, F1-score: 0.81) was acquired with QSM modality added than only routine T2 FLAIR plus T1WI + C (0.77, 0.76) in discriminating LGG from HGG. For the IDH1 task, diagnostic efficacy of T1WI + C plus QSM (accuracy: 0.80, F1 score: 0.74) or T2 FLAIR plus QSM (0.80, 0.75) was superior to T1WI + C plus T2 FLAIR modalities (0.69, 0.60). By combining the three modalities, satisfactory diagnostic efficiency was obtained (accuracy: 0.89, F1-score: 0.85, sensitivity: 0.81, specificity: 1.00, PPV: 1.00, NPV: 0.80) in determining IDH1(+). For the ATRX task, diagnostic performance of T1WI + C plus T2 FLAIR modalities was unsatisfactory (accuracy: 0.51, F1 score: 0.48), and the efficacy was slightly better when using T1WI + C plus QSM (0.63, 0.59) or T2 FLAIR plus QSM modalities (0.66, 0.60). When the three modalities participated together, moderate diagnostic ability (accuracy: 0.71, F1-score: 0.67, sensitivity: 0.70, specificity: 0.73, PPV: 0.81, NPV: 0.60) could be achieved in predicting ATRX(−). ROC curves in diagnosing gliomas based on three MRI modalities by DL are shown in Fig. 5. The mean AUC was 0.83 in differentiating LGG from HGG, 0.91 in discriminating OGG from GBM, and 0.88 and 0.78 in predicting IDH1(+) and ATRX(−), respectively. The comparisons of glioma classification results based on GLM, SVM, CNN, and our method are shown in Fig. 6. Performance comparisons of simpler histogram analysis and inception CNN model on QSM modality in glioma classification tasks are listed in Supplementary Table 1.

Table 2 Performance metrics of the deep learning model on multi-modal MRI for glioma grading and molecular subtyping
Fig. 5
figure 5

Fivefold ROC curves of the proposed method combining three MRI modalities for four glioma classification tasks (a for OGG/GBM, b for LGG/HGG, c for IDH1 (±), and d for ATRX (±)). The shadow range represents the standard deviation of fivefold

Fig. 6
figure 6

Glioma classification results based on different algorithms (GLM, SVM, and CNN (three-layer convolution)) and our method using three modalities concatenated in the channel dimension (a for OGG/GBM, b for LGG/HGG, c for IDH1 (±), and d for ATRX (±))

Discussion

QSM has shown a high degree of reproducibility on one scanner, and across different vendors, field strengths, and sites (Deh et al. 2015; Wang et al. 2017a, b), although its accuracy can be affected by spatial resolution, echo time, tissue orientation and other factors (Karsa et al. 2019; Lancione et al. 2017, 2019; Li et al. 2012). QSM has been widely used in brain diseases involving susceptibility variations (Bandt et al. 2019; Wang et al. 2017a, b). Radiomics or ML-based study using QSM has been applied for the diagnosis of Parkinson’s disease and early Alzheimer’s disease (Kim et al. 2020; Li et al. 2019). To the best of our knowledge, no ML-assisted QSM studies on glioma grading and molecular subtyping have been reported till now. Recently, DL-based methods have been widely used in medical image classification. Generally, consistency loss was utilized in the semi-supervised learning. In our research, although the binary masks over tumor regions could not directly provide the result of tumor classification, it can be used as an approximate guidance for classification. Thus, the inception module with consistency loss could enhance the ability of efficient feature extraction.

In this study, we extracted features of all ROIs from MR images automatically and assigned them to the final linear classifiers for glioma grading, IDH1(+), and ATRX(−) prediction. DL-assisted T1WI + C and T2 FLAIR exhibited the diagnostic accuracy of < 80% in glioma grading, which is similar to the published computer-aided performance and not ideal for clinical practice (Hsieh et al. 2017). DL-assisted QSM was superior to conventional MRI modalities in distinguishing OGG from GBM, but not as good as T1WI + C in differentiating LGG from HGG. This is likely because GBM is characterized by microvascular proliferation and necrosis and is more prone to hemorrhage than OGG (Ferris et al. 2017), the induced susceptibility changes of which can be easily detected by QSM (Zhang et al. 2019). In addition, grade II and III oligodendrogliomas both feature extensive calcifications and a branching network of delicate capillaries in the appearance of “chicken wire” (Wesseling et al. 2015), which increases the confusion of QSM in differentiating between grade II and III gliomas. After adding QSM to conventional MRI modalities, the accuracy of glioma grading could be improved to 85%–90%, and the AUC was 0.91 in identifying OGG from GBM. As OGGs share many similar histologic and genomic characteristics and therapeutic strategies, which are quite different from GBMs (Reifenberger et al. 2017), distinguishing OGG from GBM is very critical, and DL-assisted QSM shows great clinical significance.

Single-modal T1WI + C and T2 FLAIR showed limited value in determining the molecular subtypes of gliomas. A previous study has shown that IDH(+) gliomas have lower levels of hypoxia-inducible-factor 1-alpha and decreased angiogenesis and vasculogenesis in comparison with IDH(−) ones through the 2-hydroxyglutarate-mediated prolyl hydroxylase enzymes (EGLN, also called PHD) inhibition (Kickingereder et al. 2015). The resulting magnetic susceptibility differences between gliomas with IDH(+) and IDH(−) make it possible for QSM to distinguish between molecular subtypes of IDH1. When conventional T1WI + C and T2 FLAIR were combined with QSM modality during DL, satisfactory efficiency with AUC of 0.88 and accuracy of 0.89 could be achieved in predicting IDH1(+). DL-assisted QSM showed greater potential for both the IDH1 and ATRX task than conventional MRI. ATRX(−) is associated with alternate telomere lengthening, which can promote cellular immortality (Venneti and Huse 2015). In diagnosing ATRX(−), three MRI modalities together displayed a moderate accuracy with AUC of 0.78. In the four glioma classification tasks, QSM modality features were superior to demographic characteristics (age and sex), and QSM features were the most important variables for the classification of OGG/HGG, IDH1(+) or IDH1(−), ATRX(−) or ATRX(+). Compared with GLM, SVM, and standard CNN, our proposed inception CNN performed best in the three-modality fusion classification tasks.

Our previous study has shown the potential of multi-parametric MR radiomic features in predicting IDH1 and ATRX subtypes in LGGs (Ren et al. 2019). In recent years, many ML-based studies have been reported to be effective in determining glioma grades and IDH and 1p/19q status by multi-modal MRI (Lu et al. 2018; Sengupta et al. 2019), but the contribution of intratumoral hemorrhage and calcification has not yet been studied. One study showed the potential of deep learning based on susceptibility-weighted imaging in detecting cerebral microbleeds (Liu et al. 2019). Considering the susceptibility variations of gliomas, the present study explored the highly reproducible QSM technique in combination with an inception CNN network, which revealed a great potential of DL-assisted QSM in glioma grading and molecular subtyping.

This study had some limitations. First, due to the small sample size, IDH1 prediction task was not performed in OGGs and GBMs separately, and the results of DL need to be externally verified in larger study samples. Second, patients with HGG who were unable to tolerate long MR scans were underrepresented during the course of patient enrollment. Third, the volume information was not considered when delineating the tumor ROIs. In future studies, we plan to include multi-center QSM and perfusion MRI data to predict the molecular subtypes of OGG and GBM, respectively.

Conclusion

In conclusion, compared to conventional MRI modalities, DL-assisted QSM shows great advantages in distinguishing OGG from GBM, and predicting IDH1 and ATRX subtypes.