Introduction

High-grade glioma (World Health Organization (WHO) grade III and IV) and brain metastasis are common malignant brain tumors in adults [1]. Accurate noninvasive differentiation of the two malignancies is crucial for the clinical management and treatment options for patients [2, 3]. Combining clinical history, CT/MRI can improve the diagnostic accuracy of primary and metastatic tumors. However, given the similarity in radiological appearances for necrotic centers, irregular enhancing margins, and peritumoral edema, high-grade glioma, and brain metastasis cannot be accurately differentiated [4].

Radiomics is an emerging method that utilizes the intensity, shape, and texture of medical images invisible to the naked eyes to recognize tumors [5]. It follows some standard steps: (1) image acquisition and segmentation, (2) feature extraction and selection, and (3) modeling and validation [6]. However, the quality, applicability, and diagnostic performance of the method are influenced by methodological modifications at each step such as the type of feature [7] and the placement of the region of interest (ROI) [8] influences the performance of radiomics. At present, there is no consensus on the most accurate or appropriate radiomics model for differentiating high-grade glioma from brain metastasis.

In the present study, we systematically analyzed existing data based on trial radiomics models used in differentiating high-grade glioma from brain metastasis. We aimed to optimize the diagnostic performance and clinical applicability of the radiomics methods. The pooled diagnostic performance of radiomics was analyzed using a quantitative statistical meta-analysis.

Materials and methods

This systematic review was performed according to the Preferred Reporting Items for Systematic Reviews and Meta-analysis for Diagnostic Test Accuracy (PRISMA-DTA) guidelines [9]. Our protocol is available at PROSPERO (CRD42021269218). Literature search, data extraction, and quality assessment were performed independently by two reviewers (Y. Li and Y. Liu).

Literature search

Relevant articles were systematically searched through PubMed, EMBASE, and Cochran Library. Only original English articles published by up to 26 July 2021 were considered. To avoid erroneous omissions, all articles on differentiation of high-grade glioma from brain metastasis using MR imaging were retrieved. Selection of those on the use of radiomics was then performed manually. The search terms for the relevant articles included ((glioblastoma OR glioma) AND (brain metastases OR brain metastasis OR metastatic brain tumor)) AND (magnetic resonance OR magnetic resonance imaging).

Literature selection

Inclusion criteria

To be included, the articles must have fulfilled the following: (1) population: patients with high-grade gliomas or brain metastasis; (2) index test: quantitative analysis of MR images using radiomics method; (3) outcomes: differentiation of glioma from brain metastasis based on adequate data sufficient to reconstruct the 2 × 2 sensitivity and specificity tables; (4) original articles.

Exclusion criteria

Reviews, guidelines, conference proceedings or case reports/series, articles including low-grade gliomas, papers reporting fewer than 10 patients, and those with overlapping patient populations were excluded from the qualitative systematic review. For articles with overlapping patient cohorts, both were included only if the methodological difference was significant.

For the further quantitative statistical analyses, studies with suspected data leakage (i.e., without explicitly mentioning the use of independent validation/testing, or using the entire dataset in feature selection) were excluded to avoid the potential overestimation.

Data extraction and quality assessment

Following data were extracted from the included studies. (1) Patient characteristics: number of overall cases and patients with glioma or brain metastasis, WHO grade of glioma, mean age, age range, sex ratio of the study participants, and nature of the tumor (primary or metastasis); (2) Study characteristics: authors, institution, duration of patient recruitment, study design (prospective or retrospective), method of patients’ recruitment (consecutive or non-consecutive), reference standard, interval between MRI and the reference standard; (3) MRI characteristics: magnetic field strength, vendor, scanner model, and sequence; (4) Radiomics protocols: MRI sequence, ROI/VOI placement, segmentation, feature selection, and artificial intelligence (AI) classifier (if available); and (5) Outcomes: true positive, false positive, false negative, and true negative. If these parameters were not provided, the sensitivity and specificity data from the original article were used for calculations. For articles with multiple sets of experiments (i.e., multiple comparisons), all of the presented groups were included.

The risk of bias and concerns about the applicability of the included studies were assessed using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool [10]. The tool has four domains: “patient selection,” “index test,” “reference standard,” and “flow and timing” each with sub-sections. Disagreements were arbitrated by a third reviewer (R. Wei).

The methodological quality of the radiomics of the included studies was assessed using the Radiomics Quality Score (RQS) system [6] under the following parameters: image acquisition and segmentation, features extraction and selection, modeling, model validation, and data sharing. The total scores ranging from − 8 to 36 were a product of 16 dimensions (Table 3). In the domains of feature selection and validation, we further assessed when the feature selection was applied in the modeling process, as well as how the data sets were allocated for modeling to identify possible data leakage. Since RQS is a quantitative tool for several ordinal questions, the inter-rater agreement for single items of the RQS was calculated using the Fleiss kappa statistic for ordinal variables [11]. The Kappa result is suggested by Cohen to be interpreted as follows: values ≤ 0 as indicating no agreement and 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement. Inter-rater agreement for the pooled RQS was assessed using the interclass correlation coefficient (ICC).

Statistical analysis

Inter-rater agreement of RQS was calculated using the “IRR” package in R version 4.1.1. Meta-analysis was performed using the Meta-DiSc 1.4 and “MIDAS” modules in Stata 16 software. Most of the included articles used machine learning classifiers that reported results for multiple cohorts, such as training, testing, and validation set; however, we only employed the generalizable data from the validation/testing set. Pooled sensitivity and specificity for homogeneous data were determined using a fixed-effect model. Furthermore, the diagnostic performance of different radiomics protocols was compared according to the following rules: (1) T1WI, T2WI, and T1CE only vs diffusion MRI; (2) inclusion vs. exclusion of the peritumoral; (3) feature extraction from ROI vs. VOI; (4) segmentation method (manual vs. (semi) automatic); (5) presence vs. absence of AI classifier. Due to the possible risk of data leakage, pooled sensitivity/specificity of excluded studies with diffusion MRI [7, 8, 12,13,14] or without AI classifiers [7, 8, 12, 15, 16] is presented here only for reference purposes.

The heterogeneity of the included studies was based on the Cochrane Q-test (p value ≤ 0.05) and I2 statistic (> 50%) [17]. Spearman coefficient > 0.6 was considered to indicate a considerable threshold effect [18]. Publication bias was analyzed using a funnel plot.

Results

Literature search

The detailed literature search and selection process is shown in Fig. 1. The initial search identified 1,456 articles, of which 390 were duplicates. After screening through titles and abstracts of the remaining 1,066, an additional 1,045 articles were removed. After reading through the remaining 21 full texts, five more articles were removed for the following reasons: (1) two studies contained data insufficient to generate the 2 × 2 tables [19, 20]; (2) in one study, there was no comparison between glioma and metastasis [21]; (3) one study reported overlapping cohort [22]; (4) one study included low-grade glioma [23]. In one article [24], a deep learning model (DL) was independently validated. Considering its difference from conventional machine learning (cML), this article was regarded as two distinct studies (named Bae (cML) and Bae (DL)). For qualitative systematic reviews, 17 studies with a combined total of 1,717 patients were included [4, 7, 8, 12,13,14,15,16, 24,25,26,27,28,29,30,31]. Seven of them were suspected of possible data leakage and were therefore excluded from quantitative statistical analysis (Supplementary Table S1) [7, 8, 12,13,14,15,16]. Ultimately, the results of quantitative statistics were determined by the remaining 10 studies only [4, 24,25,26,27,28,29,30,31].

Fig. 1
figure 1

A flowchart showing the selection process for included studies

Characteristics of the included studies

The characteristics of the included studies are shown in Table 1. The radiomics protocols are shown in Table 2 while their proportions are presented in Fig. 2C–G. Eight studies only investigated T1WI, T2WI, or T1CE images [4, 15, 24,25,26, 28, 29], five examined diffusion MRI [7, 8, 12,13,14], and four combined multiple MRI techniques [16, 27, 30, 31]. In addition, 10 studies only investigated the tumor area [4, 8, 13, 14, 16, 25, 26, 28, 30, 31], two only investigated the peri-tumor area [12, 15], whereas five studies investigated both [7, 24, 27, 29]. Semi and automatic segmentation was applied in seven studies [4, 7, 13,14,15, 24, 27], whereas manual segmentation of images was reported in 10 studies [8, 12, 16, 25, 26, 28,29,30,31]. Feature selection was performed in nine studies [4, 8, 13, 15, 24,25,26,27,28] to reduce the risk of over-fitting [6]. AI classification was reported in 12 studies [4, 13, 14, 24,25,26,27,28,29,30,31]. The details of the radiomics protocols are shown in Supplementary Table S2.

Table 1 Characteristics of the included studies. Those for Bae et al cML and DL share the same row due to the same patient cohort. Studies in boldface are included in the quantitative statistical analysis
Table 2 Radiomics protocols which yielded the highest AUC in each study. For studies that did not provide results of feature selection (marked with *), all extracted features are presented in the table instead of the portion actually used in the model. Studies marked in boldface are included in the quantitative statistical analysis
Fig. 2
figure 2

Visual summary of the studies. A, Proportion of validation/testing. B, Proportion of potential sources of data leakage. C, Proportion of machine-learning methods used. D, Proportion of MRI modalities used. E, Proportion of segmentation methods. F, Proportion of ROI and VOI. G, Proportion of investigated areas. H, Diagnostic performance of different radiomics strategies, the sensitivities/specificities marked with * were regarded as studies with suspected data leakage. AI: artificial intelligence, DNN: deep neural networks, LASSO: Least Absolute Shrinkage and Selection Operator, MLP: multi-layer perceptron, NNW: neural networks, RF: random forest, ROI: region of interest, SVM: support vector machine, T1CE: T1 contrast enhancement, T1WI: T1 weighted image, T2WI: T2 weighted image, VOI: volume of interest

Quality assessment

QUADAS-2

For the patient selection domain, all studies used a case-control design and the exclusion criteria in two studies [4, 7] were unclear. In the domain of index test, the thresholds of the test indexes were not pre-specified in 11 studies [7, 8, 12,13,14,15,16, 26, 27, 30, 31]. Regarding the reference standards, two studies were unclear [7, 27], whereas the time intervals between MRI and the standard reference were not reported in 14 studies. However, there was no concern regarding applicability considered in the QUADAS-2 tool except for two studies that did not report the reference standards. The most common risk factors for bias factors identified in the included studies are summarized in Fig. 3. The initial flow of assessments of the two reviewers is presented in supplementary Table S3.

Fig. 3
figure 3

Causes of risk of bias classified and colored according to the dimensions of the QUADAS-2 tool. The length of the bars represents the frequency with which one cause of the risk of bias was identified in the included studies

RQS

The RQS results of the included studies were as follows: mean ± standard deviation −5.13 ± 5.82, median −4.25, and the total scores ranged from −3 to 15 (0–42%). The average rating and inter-rater agreement test results for each item are shown in Table 3, whereas Table S4 presents the individual and summarized scores for each study. In most studies, multiple segmentation, feature selection, discrimination statistics, and biological correlation were performed. The potential clinical utility was also explored. In contrast, no studies explored the stability of radiomics features under different scanners or at different time points. Also, calibration statistics and evaluation of the cost-effectiveness of the protocols were not performed in any of the studies. Modeling of clinical features of patients was only performed in a single study [4]. MRI protocol was provided in most of the studies. The segmentation method was provided in four studies [4, 13, 14, 27]. The GitHub code was only available in one study [25].

Table 3 Ratings of individual RQS items of the included studies and Fleiss’s kappa values for the two raters

The proportions of validation/testing and potential sources of data leakage are shown in Fig. 2A and B. Six studies used independent validation/testing sets: two used internal sets [4, 28] and four used external sets [24, 25, 29] and reported that only independent training set data were used before validation/testing to avoid potential data leakage. Six studies employed cross-validation [13, 14, 26, 27, 30, 31]: three declared that feature selection was only performed in each fold of cross-validation [26, 27, 30], two used the entire dataset for feature selection [13, 14], and one did not conduct selection. Five studies reported no validation/testing [7, 8, 12, 15, 16]. Studies that did not perform validation/testing [7, 8, 12, 15, 16], as well as those that used the entire dataset in feature selection [13, 14] were excluded from quantitative statistics because of suspected data leakage. The detailed criteria for determining whether studies are with data leakage suspicion are shown in Supplementary Table S1.

Substantial or almost perfect inter-rater agreement was achieved for all individual items of the RQS. The ICC for the total RQS was 0.98 (95% CI 0.93–0.99).

Heterogeneity and diagnostic accuracy

Based on the Cochran’s Q tests (Q = 13.81 for sensitivity and Q = 3.10 for specificity, both p > 0.05) and I2 statistics (I2 = 34.84% for sensitivity and I2 = 0.00% for specificity), the studies were homogenous. Spearman correlation coefficient was 0.097 (p = 0.789), indicating lack of threshold effect. The pooled sensitivity and specificity of the radiomics based on the 10 studies was 84% (95% CI, 80–88%) and 84% (95% CI, 81–87%) (Fig. 4), respectively. Deeks’ funnel plot revealed a low likelihood of publication bias (p = 0.08; Fig. 5).

Fig. 4
figure 4

Coupled Forest plots of pooled sensitivity and specificity

Fig. 5
figure 5

Deeks’ funnel plot. The p-value of 0.08 suggests that the likelihood of publication bias is low

As shown in Table 4 and Fig. 2H, the higher sensitivity and specificity can be attributed to (1) use of DWI or DTI [7, 8, 12,13,14], (2) application of (semi) automatic segmentation [4, 24, 27], and (3) features extracted from VOI [25, 27, 29, 30]. However, including the peritumoral area [24, 27, 29] increased the sensitivity of the model but not the specificity. Applying the AI classifier [4, 24,25,26,27,28,29,30,31] improved the specificity but slightly decreased the sensitivity.

Table 4 Diagnostic performance of studies grouped by radiomics protocols. The sensitivities/specificities of the groups that only investigated diffusion MRI or did not use AI classifiers were regarded as studies with suspected data leakage

Discussion

In the present study, we assessed the diagnostic performance of radiomics in differentiating high-grade gliomas from brain metastasis by systematically reviewing existing studies. To the best of our knowledge, there are only three systematic reviews on this topic that merely evaluated the diagnostic accuracy of manual evaluation of diffusion or perfusion MRI [32,33,34]. We found that the pooled sensitivity and specificity of the radiomics models were 84% (95% CI, 80–88%) and 84% (95% CI, 81–87%), respectively. In particular, the diagnostic performance of radiomics in analyzing DWI or DTI [7, 8, 12,13,14] (a pooled sensitivity and specificity of 89% and 90%) was much superior to that of a manual review of DWI or DTI (a pooled sensitivity and specificity of 80% and 81%, respectively) evaluated in another meta-analysis [32].

Radiomics has been used for almost a decade and despite its significant advantages, its clinical application suffers from numerous limitations [35, 36]. Based on the findings of this study, radiomics has the unique advantage of quantitatively combining clinical information and medical imaging. However, the underutilization of this hinders its clinical application. The USA National Institute of Health (NIH) describes radiomics features as a type of “markers” or “characteristics” that are objectively measured and evaluated to reflect the state of a biological or pathogenic process or pharmacologic responses to a therapeutic intervention [37]. Nevertheless, clinical data such as age, sex, and exposure to ionizing radiation are factors closely associated with the pathogenic process of glioma and thus should not be ignored in a diagnostic model [38]. An included study found that clinical features had a high weight in their radiomics model [4]. Halligan et al proposed that clinical imaging biomarkers should be incorporated into a multivariate model to evaluate their additive contribution to the overall outcome [38]. Similarly, one study revealed that clinical features were superior to radiomics in accurately predicting the invasiveness of lung adenocarcinoma (78.4% vs 73.0%). However, the combination of the two was outperformed either technique alone (83.8%). For instance, age is a risk factor for the development of high-grade glioma, but the contribution of age to the development of brain metastasis is not uniform [39, 40]. Age or other numerical clinical data can be readily quantified by radiomics models (i.e., age as a variable in an algorithm or function). However, radiologists can only provide qualitative diagnoses based on a broad “age group.” Therefore, future radiomics analyses should incorporate clinical features to build more reliable models or add radiomics features to existing diagnostic models to verify their true diagnostic capabilities.

The lack of standardized quality control and reporting throughout the entire workflow limits the application of radiomics. For instance, at each step of a radiomics study, it is essential to keep validation/testing data completely independent or invisible before validation/testing for a generalizable predictive model [41]. Studies using the entire dataset for both model training and evaluation without any validation/testing [7, 8, 12, 15, 16] were with potential biases because they were unable to demonstrate the replicability of the model on unknown data. Two studies [13, 14] used the entire dataset for feature selection before cross-validation, rather than within each fold of the cross-validation. Under such circumstances, the features included in the model are determined by the entire data set (including those used for validation/testing), resulting in an overestimation of the diagnostic performance [41]. In addition, non-standard and incomplete reports, such as the absence of specifically selected features, made it impossible to validate the resultant model. Detailed and accurate reporting is necessary for the application and reproduction of a method. In light of this, we recommend the development of guidelines or checklists for this approach, such as checklists for artificial intelligence in medical imaging (CLAIM) [42] or rules for predictive modeling of individual differences in neuroimaging [41].

The diagnostic performance of the radiomics model varied with the strategies employed. For instance, the model performed better when using diffusion MRI. It provides functional information on the invasion of white matter [43, 44] which can be revealed by radiomics. Second, features from the peritumoral area can be heterogeneous because peritumoral edema of metastases is considered “pure vasogenic”, which is due to parenchymal compression, whereas the edema of high-grade glioma is caused by infiltration of malignant cells [45, 46]. Therefore, peritumoral area features provide more comprehensive information for the radiomics model. Third, even though the inclusion of multiple slices of the tumor, for instance, VOI generates large data, it increases the risk of over-fitting, implying the need for a robust feature selection algorithm. Fourth, automatic segmentation techniques can segregate the tumors based on MRI-derived biomarkers, which may facilitate the building of the radiomics model [47]. Finally, although the AI classifier did not show prominent diagnostic performance in our evaluation, it is undeniably a future research direction and development trend. Most of the included studies employed machine learning or deep learning in feature selection [24,25,26, 28] or classification [4, 13, 14, 24,25,26,27,28,29,30,31] but the best performing AI classifiers in each study were different. To date, there is no universal classifier because the performance of a classifier can be affected by the characteristics of the sample [48]. Identifying a uniform and robust classifier for specific medical problems has always been challenging [49].

Regarding limitations, first, given that all the studies included in the meta-analysis were retrospective in design, they suffered inherent selection bias and data loss. Therefore, prospective multicenter trials are needed to validate our findings. Second, there was no heterogeneity among the included studies, which generated insignificances in the comparison of different radiomics protocols. Third, given the small sample size, the effects of the various combinations of different parameters captured at each step of radiomics were not explored. Fourth, different radiomics feature was not considered one of the variables in our comparison due to their heterogeneousness. Fifth, the diagnostic performances of radiomics investigating diffusion MRI or without AI classifier were obtained from studies with possible data leakage, relatively lacking reliability and generalizability.

Conclusively, our meta-analysis demonstrated the potential of radiomics in differentiating high-grade gliomas from brain metastasis. How to improve the performance of a radiomics model was also explored. However, more studies are needed to further improve the application and reliability of radiomics.