Introduction

Metastasis (MET) and glioblastoma (GBM) are common malignant brain tumours in adults [1]. GBM accounts for more than half of all primary malignant central nervous system tumours [2]. Brain MET is estimated to be at least 10 times more common than primary brain malignancy [3]. The ability to differentiate brain MET from GBM is important, because medical staging, surgical planning, and therapeutic decisions are different for the two tumours [4,5,6]. GBM cases usually do not require a systemic examination, as spread outside of the central nervous system is rather rare; however, for cases suspected of brain MET without a previous history of systemic cancer, finding the site of the primary carcinoma and evaluating the comprehensive systemic staging are important before any surgical intervention or medical therapy [4]. Though biopsy may be used to differentiate the two tumours, the noninvasive method is preferable and sometimes mandatory when a biopsy is impossible, such as when the tumour is close to or involves eloquent area or when the patient is too weak to undergo a surgery [7].

MR imaging is an important modality for evaluating brain tumours. In patients with a history of systemic cancer and multiple lesions, differentiation of brain MET from GBM may be easily performed using conventional MRI. However, single metastases were estimated to occur in more than 25% of cases of brain MET [4, 8, 9]. Additionally, approximately 3% of high-grade glioma cases display systemic malignancy [10], and up to 20% of GBM cases were multifocal lesions in some reports [11]. Furthermore, as both brain METs and GBM can present with contrast enhancing and necrotic areas, they often present a similar anatomic MR imaging appearance [4].

It has been shown that GBMs are infiltrative lesions that invade the surrounding regions while METs are not infiltrative [12]. This may lead to some differences, primarily including cells, oedema type, angiogenesis, and so on, between the peri-enhancing oedema regions of the two tumours [13]. To find these differences, some advanced MR imaging techniques, such as dynamic susceptibility contrast (DSC) and dynamic contrast-enhanced (DCE), are required [12]. Conventional MR imaging, however, is still difficult to distinguish these differences and differentiate between the two tumours [12, 14].

Radiomics is an approach to extract high-throughput data from images, and the data can then be mined for improved decision support [15]. Previous study showed that radiomics offers important advantages for assessment of underlying tumour pathophysiology and improves the ability to distinguish between the tumours [15, 16]. Usually, workflow of four steps is included in radiomic study, and building and evaluating a mathematical model or classifier is considered as the last step in the process [17, 18]. However, sometimes the model or classier is not satisfactory. Though multiple classifiers were found in many previous radiomic studies [12, 19, 20], the main purpose was to find the most optimal classifier. Because each classifier may have its own advantages or disadvantages due to the different algorithms [21, 22], we hypothesised that the combined use of multiple classifiers, like consultation among specialists of a multi-disciplinary team (MDT), might bring about extra benefits. Thus, the present study sought to differentiate supratentorial single brain MET from GBM by using radiomics features derived from the peri-enhancing oedema region and multiple classifiers using conventional MR images, especially to explore the value of the combined use of the classifiers.

Methods

Patients

This retrospective study was approved by the Local Ethics Committee of our hospital, and the requirement for patient informed consent was waived. One hundred and twenty patients from four MRI scanners (15 GBMs and 15 METs for each scanner) with single and supratentorial tumours (brain MET, n = 60, June 2013 to May 2019; GBM, n = 60, March 2013 to May 2019) were studied at our hospital. All included patients underwent preoperative MRI scanning, including T1-weighted, T2-weighted, and contrast-enhanced T1-weighted (T1C) images, and there was no artefact in the images. Each case contained sufficient peri-enhancing oedema region size for measurement. Patient information was de-identified, and medical history was concealed prior to analysis. The entire data set was randomly assigned to either the training (70%) or validation (30%) data set, with sampling stratified according to the type of tumour (brain MET or GBM) by R 3.3.1 (http://www.Rproject.org).

MRI protocol

MRI was performed with 1.5T Sonata, Aera scanner (Siemens Medical Solutions), 3.0T Discovery MR750, Signa HDxt scanner (GE Healthcare). Axial T1-weighted images, T2-weighted images, and T1C images were acquired with a section thickness of 6 mm (Table 1). Pre-contrast gadodiamide (Omniscan, GE Healthcare) was injected through a peripheral venous catheter at a dose that was standardised based on patient body weight (0.2 ml/kg body weight, up to a maximum of 20 ml).

Table 1 The parameters for imaging

Peri-enhancing oedema region was defined as an area in immediate contact with the enhancing portion margin of the tumour and showed no enhancement on T1C images, but hyperintensity on T2W images [23]. Figure 1 shows a brain MET and a GBM, both of which are cystic-solid lesions, with the peri-enhancing oedema region.

Fig. 1
figure 1

A MET case (a) and a GBM case (b). Both cases show cystic-solid lesions, along with peri-enhancing oedema region

Image preprocessing and segmentation

First, using T2-weighted images as main images, automatic registration with rigid registration and the mutual information similarity metric, reslicing of moving image into the space of the main image for T1C images and T1-weighted images were performed using ITK-SNAP software (http://www.itksnap.org). Then, manual segmentation of peri-enhancing oedema regions for all cases was performed by a radiologist (F.D., with 8 years of experience) on T2-weighted, T1C, and T1-weighted images. The segmentation was re-performed in twenty cases (brain MET, n = 10; GBM, n = 10) which were randomly selected among the included patients by another radiologist (Q.L., with 5 years of experience). Regions of interest (ROIs) were drawn approximately 3 mm away from the outer margins of the enhancing margin of the tumour to avoid a transverse partial volume effect [23]. Figure 2 shows the workflow of this study.

Fig. 2
figure 2

Workflow of this study. a ROI delineation; b feature extraction; c feature selection; d classifier building; e combined use of the classifiers

Feature extraction

Consistent with the Imaging Biomarker Standardization Initiative (IBSI) [24, 25], 14 shape features, 18 first-order features, 24 grey-level co-occurrence matrix (GLCM) features, 14 grey-level dependence matrix (GLDM) features, 16 grey-level run length matrix (GLRLM) features, 16 grey-level size zone matrix (GLSZM) features, and 5 neighbouring grey tone difference matrix (NGTDM) features were extracted from each of the axial T1-weighted, T2-weighted, and T1C images by using pyradiomics (http://www.radiomics.io/pyradiomics.html) [26,27,28](supplemental data 1 and 2). The detailed description of these features can be found on the above website and in previous literature [24, 26, 27]. A total of 321 features were ultimately extracted in this study.

Feature selection

First, the stability of the extracted features was evaluated by inter-observer reproducibility of the two image readers. Intraclass correlation coefficient (ICC) values were calculated for each feature of the twenty patients. Features with ICC value ≥ 0.90 [29, 30] were selected in this study. Then, the retained features in entire data set were processed with the ComBat harmonisation method [31], which was found useful to reduce the batch effect caused by different scanners [32, 33]. The patients from the same scanner were considered as a batch; therefore, there were a total of four batches, based on which the harmonisation performs a transformation for each feature [33]. Parametric empirical Bayesian adjustments were used in the process. After that, the Boruta algorithm was implemented to further select important features using the training data set. The Boruta algorithm is a wrapper built around the random forest classification algorithm; it can select all relevant features [34,35,36].

Classifier building and combination

With the selected features and the training data set, five base classifiers were built. These included decision tree (DT), support vector machine (SVM), neural network (NN), naive Bayes (NB), and k-nearest neighbour (KNN) classifiers [22]. All five classifiers belong to the supervised learning category, but work with a different algorithm. Each of the classifiers has its own advantages and limitations [21, 22]. Thus, the performance may be different among classifiers even when they are fed with the same data. Instead of finding a classifier with the best performance, we were much more interested in exploring the combined performance of these classifiers, which might be similar to the specialist consultation of MDT.

Thus, after the 5 base classifiers were built, a same weight and simple majority vote method [37] was first used to find the combination performance of the base classifiers. During this process, each classifier was regarded as a specialist and provided with the same weight for the diagnosis. The final diagnosis was made according to simple majority rule [37, 38]. For example, there were 5 classifiers in all, and if a case was diagnosed as GBM by 3 classifiers and was diagnosed as MET by 2 classifiers (3A pattern), then the final diagnosis was GBM. In this study, three agreement patterns were noted: all 5 classifiers reached agreement (5A pattern), 4 classifiers reached agreement (4A pattern), and 3 classifiers reached agreement (3A pattern).

To determine whether proper weights for the classifiers could further improve the overall classification performance, the logistic regression algorithm was implemented using the results of the 5 classifiers as independent variables and the ground truth as dependent variable.

Accuracy, sensitivity, and specificity were used to evaluate the classification performance.

Statistical analysis

The age data were presented as the mean ± standard deviation (SD) or median, according to whether or not a normal distribution was present. The demographic characteristics between GBMs and METs in the entire data set were compared using the Pearson chi-square test, student’s t test, or the Mann–Whitney U test, as appropriate. The statistical significance levels were two-sided, with the statistical significance level set at 0.05. The statistical analyses were performed using SPSS19.0.

The feature processing, selection, and classifier building were mainly performed with the following R packages: ‘psych’, ‘sva’, ‘lattice’, ‘sampling’, ‘Boruta’, ‘ranger’, ‘ggplot2’, and ‘caret’. Tenfold cross-validation was performed and repeated 3 times to find the best parameters, and accuracy with the largest value was used to select the optimal model [39].

Results

One hundred and twenty patients (63 men and 57 women) were included in the study. Ages ranged between 20 and 84 years. Brain METs originated from the lung (n = 39), digestive tract (n = 8), breast (n = 4), uterus (n = 1), ovary (n = 2), liver (n = 1), skin (n = 1), and unknown origins (n = 4). Eighty-four patients were assigned to the training data set (GBM, n = 42; MET, n = 42), and 36 patients were assigned to the validation data set (GBM, n = 18; MET, n = 18). The demographic and clinical data were presented in Tables 2 and 3.

Table 2 Results of the analysis of basic demographic and clinical data
Table 3 Results of the analysis of basic demographic and clinical data for sub-dataset

Feature selection

A total of 271 features showed ICC values ≥ 0.90, including 91 features from T1-weighted images, 89 features from T2-weighted images, and 91 features from T1C images. After the use of the Boruta algorithm, 3 features were finally selected, i.e., one feature from T2-weighted images, the original_glszm_SizeZoneNonUniformityNormalized, and two features from T1C images, original_gldm_DependenceNonUniformityNormalized and original_glrlm_RunLengthNonUniformityNormalized.

Performance of classifiers

Five base classifiers were built using the 3 selected features. The performances of the 5 classifiers were not all the same. The classifiers showed an accuracy of 0.70 to 0.76, sensitivity of 0.57 to 0.98, and specificity of 0.43 to 0.93 for the training data set, with an accuracy of 0.56 to 0.64, sensitivity of 0.39 to 0.78, and specificity of 0.50 to 0.89 for the validation data set. The DT classifier showed the highest specificity among the base classifiers, but the lowest sensitivity. In contrast, the NN classifier achieved the largest sensitivity but showed the lowest specificity. The SVM classifier had some advantages in high accuracy and sensitivity for the training data set; however, it did not work well for the validation data set. The KNN classifier exhibited the highest accuracy for the training data set, but other performance parameters were common. The NB classifier showed an overall medium performance with respect to accuracy, sensitivity, and specificity.

When the classifiers were combined by the same weight and simple majority vote method, they showed an overall accuracy, sensitivity, and specificity of 0.79, 0.83, and 0.74, respectively, for the training data set, and 0.64, 0.50, and 0.78, respectively, for the validation data set (Fig. 3). The overall accuracy was higher than base classifiers in the training data set and at an equivalent level with the best performance of classifiers in the validation data set. The 5A pattern was found in 40.5% (34/84) of cases in the training data set and 36.1% (13/36) of cases in the validation data set. The 4A pattern was found in 38.1% (32/84) of cases in the training data set and 33.3% (12/36) of cases in the validation data set. The 3A pattern was found in 21.4% (18/84) of cases in the training data set and 30.6% (11/36) of cases in the validation data set. Further analysis showed that different agreement patterns had different classification performances and brought diverse performance advantages when compared with the separate use of the classifiers (Fig. 3). The 5A pattern showed the largest accuracy, sensitivity, and specificity (0.94, 1.00, and 0.89, respectively) for cases in the training data set and the largest accuracy and specificity (0.77 and 1.00, respectively) for cases in the validation data set when compared with other agreement patterns. Additionally, the 5A pattern achieved the largest accuracy and sensitivity in the training data set, as well as accuracy and specificity in the validation data set, when compared with the base classifiers. Both accuracy and specificity in the training data set and validation data set, as well as sensitivity in the training data set, showed a downward trend from the 5A pattern to the 3A pattern.

Fig. 3
figure 3

The performance of the 5 base classifiers and the combined use of the classifiers: accuracy (a), sensitivity (b), and specificity (c). DT, decision tree; SVM, support vector machines; NN, neural network; NB, naive Bayes; KNN, k-nearest neighbour; 5A, five classifiers reach agreement; 4A, four classifiers reach agreement; 3A, three classifiers reach agreement; VOT, combined use of the classifiers by same weight and simple majority vote method; LOG, combined use of the classifiers by logistic algorithm with different weights

When exploring different weights for the combined use of the base classifiers, the logistic regression algorithm showed an intercept of − 2.8364 and weights of 2.4677, 0.7768, 1.1095, 2.9146, and − 0.4776 for the DT, KNN, SVM, NN, and NB classifiers, respectively. The overall performance of the logistic regression algorithm was characterised by accuracy, sensitivity, and specificity of 0.80, 0.81, and 0.79, respectively, for the training data set and 0.64, 0.50, and 0.78, respectively, for the validation data set. It showed a comparable performance in the training data set and the same performance in the validation data set when compared with the same weight voting method overall.

Discussion

It is important to differentiate between single brain MET and GBM, and yet, it is still difficult at times to distinguish the two tumours by using a traditional interpretation of conventional MR images [12, 14]. In the present study, we built multiple classifiers using radiomic features derived from the peri-enhancing region to differentiate supratentorial single brain MET from GBM. We found that, based on the selected features, the 5 classifiers separately only showed moderate performances in differentiating the two kinds of tumours. Combined use of the classifiers could bring about extra benefits to increase the classification performance of cases, especially for those with all classifiers reaching agreement.

In our study, all the ultimately selected features were texture features. Since radiomic features may capture the heterogeneity of lesions [16, 40], as well as the intra-tumoural heterogeneity that usually reflects the variations in blood flow, oedema, and necrosis [36], we speculate that these features had some relationship with the heterogeneity in the peri-enhancing oedema regions for the two tumours.

It is still a challenge to interpret the relevance between radiomic features and the effect variables [36]. From the selected features, at least we know that there are some relevant features in T1C images and T2-weighted images in the peri-enhancing oedema region differentiating between brain MET and GBM, and this may provide us with some enlightenment to find visible and comprehensible features in the future. This may indicate that the radiomic feature selection method may be a useful way to narrow the scope of feature exploration for human interpretation, especially for unfamiliar diseases.

In this study, the accuracy of the 5 base classifiers separately was inferior to the median performance (approximately 0.75) of the previous study, which built 20 classifiers (accuracy 0.57 to 0.87) [12]. One reason for this may be the wide ROI we used for GBMs. Though the entire peri-enhancing oedema region was known as pure vasogenic oedema for brain MET, it showed a mixed pattern for GBM, within which tumour cell infiltration was mainly found to be adjacent to the enhancing region [12, 13]. However, it is difficult to define the range of cell infiltration for GBM on T1-weighted images, T2-weighted images, and T1C images.

In fact, the performance of a radiomic classifier or model might not always be satisfactory. The primary data may be an important reason affecting the classification performance. One of our experiences showed that some image data which were very difficult for human interpretation might also pose a challenge for a radiomic classifier or model, which we thought may be related to the lack of effective features. Algorithms may also affect the classification results. Even fed with the same features, the previous study showed that the accuracy performance of the Fine Gaussian SVM classifier was much worse than that of the linear SVM in differentiating brain MET from GBM [12]. For fixed data, building many classifiers and choosing the best one is a way to create a better classifier, which was widely used in previous studies [12, 19, 20]. In a previous radiomic study, even as much as 88 classifiers were built [20]. However, it would still be unsatisfactory if all the built classifiers exhibited poor performance.

Clinically, MDT provides a collaboration among diverse professionals, can produce a comprehensive decision, and is very useful for making proper clinical decisions [41, 42]. Similarly, the combined use of multiple classifiers, which belongs to the stacking approach of ensemble learning [43, 44], can also produce a comprehensive result and may be a way to improve the performance of the base classifiers [37, 44, 45]. In this study, the combined use of multiple classifiers indeed brought about extra benefits. We consider the results of the classifiers or algorithms themselves were also data, the transformation of primary data. And further analysing these results is a way for deeply mining the data. Thus, building classifiers may not be the last step, but may be a new start for some radiomic studies.

This study provided a method for differentiating single brain MET from GBM with conventional MRI. Though only offering moderate performance and still unable to replace pathological diagnosis by biopsy or operation, it provides evidence that, by analysing the radiomic features derived from the peri-enhancing region, conventional MR imaging can also be used for differentiating single brain METs from GBMs. When compared with the same evaluation metric using sensitivity and specificity, our results are even comparable to those of advanced MR imaging methods [46,47,48]. However, conventional MR imaging with our method was much more available than those advanced MR imaging. Furthermore, this study provides a method for further mining the data in radiomic studies, which may bring about extra benefits. It also indicates that there may be some potential value in combining the use of classifiers built in different research centres or companies, or even in using heterogeneity data from classifiers devised worldwide.

Some limitations should be considered in the current study. First, the number of subjects included in this study was small, and the ratio of brain MET and GBM was not representative of the general population. Second, only features in the peri-enhancing oedema region were extracted, although features in other regions may also be useful and can be studied in the future. Third, only the agreement pattern for the same weight and simple majority vote method was analysed, and more complex analysis for different weights and other voting methods can be further explored in the future.

Conclusions

With three features derived from the peri-enhancing oedema region, all five classifiers separately showed moderate value in differentiating supratentorial single brain MET from GBM. However, combined use of the classifiers, like MDT consultation, could generate extra benefits for increasing the classification performance of cases, especially for those with all classifiers reaching agreement. In addition, though building classifiers is usually considered as the last step of the workflow for radiomics, it is not the end of radiomics. Further mining or using the results of the classifiers might lead to better decision support.