Introduction

Non-contrast-enhanced CT (NECT) scan remains the first choice for patients with suspected acute intraparenchymal hematomas at emergency room [1]. The most frequent etiologies for spontaneous hematomas are hypertension (HP) [2], rupture of intracranial aneurysms (rIAs) [3], brain arteriovenous malformation (bAVM)-related bleeding [4], hemorrhagic transformation of ischemia [5], and cerebral amyloid angiopathy (CAA) [6]. In clinical practice, early discrimination of AVM-H was crucial, since ruptured AVM nidus should be excised or embolized to avoid rehemorrhage. At present, hematomas caused by HP, CAA, and AVM are difficult to discriminate on NECT by naked eyes; radiologists rely on angiography to detect this hematoma type; the features of AVM-H on NECT were poorly studied. Radiologists would recommend further angiography for suspected cases if the hematoma was located at lobe, presenting calcification or having an indented border of hematoma on NECT. However, the interpretation of hematoma location was sometimes ambiguous and signs like indented border only present in small portion of patients. What’s more, compared with NECT, angiography is time consuming, invasive, entailing contrast injection and compliance from patients. Therefore, new tools should be developed to extend our interpretation to the AVM-H’s features on NECT and help human raters screening out them.

AVM-related hematomas are more heterogenous in composition since the malformed vasculature is always embedded in the hematomas. Besides, dilated veins which surround the hematoma may indent the border. Therefore, we hypothesize that radiomics, which is an emerging technique in analyzing lesions’ shape and texture feature, can aid our diagnosis on NECT [7,8,9,10]. Radiomics features incorporate the characteristics of shape, intensity, texture on original image of the lesion as well as on images transformed by various filters like wavelet and Laplacian of Gaussian (LoG). In combination of feature selection (FS) methods and machine learning algorithms, predictive models can be constructed on a training dataset and further be evaluated on a test dataset. This technique has been avidly applied in tumor research but rarely in hemorrhagic diseases [11, 12]. We conducted this study to explore whether or not AVM-related hematomas have distinct radiomics features and if we can use these features to precisely screen them from the other hematoma types on NECT.

Methods

Population and data acquisition

In total, 261 cases with intraparenchymal hemorrhage were retrospectively reviewed and included in our study. Hematomas located in the cerebellum and brain stem were not included. The study was approved by the institutional ethics committee in our center and informed consents were obtained for all patients. In the training and validation dataset (n = 180), there were 40 AVM-associated hematomas (22.2%), 92 HP-associated hematomas (51.1%), and 48 CAA-related hematomas (26.7%). While in the test dataset (n = 81), the number was 18 (22.2%), 37 (48.1%), and 26 (29.6%), respectively. Demographics and hematoma characteristics were summarized in Table 1.

Table 1 Demographics and hematoma characteristics

NECT images were acquired within 6 h of the onset of symptoms with a Sensation 16 CT scanner (Siemens). The scanning energy was 120 kVp and smart mAs was used. Slice thickness was 4.5 mm and the pixel spacing was 0.408 × 0.408 cm2.

Feature extraction and stability evaluation

Image segmentation was performed manually on NECT by two radiologists (Yupeng Zhang and Baorui Zhang) to delineate the regions of intraparenchymal hematoma (Fig. 1). Then, a total of 576 radiomics features were extracted for each patient. Features were divided into six groups: (1) First-order statistics of hematoma intensity (n = 18), (2) shape (n = 16), (3) texture (n = 22, derived from GLCM), (4) texture (n = 16, derived from GLRLM), (5) wavelet-based features (n = 448), and (6) Laplacian of Gaussian-filtered image features (n = 56). Feature calculations were automatically done using the PyRadiomics package implemented in Python [13]. Each feature was named by concatenating the image type from which the feature was extracted, feature group and feature name by underline. For example, original_glcm_Autocorrelation was a feature extracted from the original image, GLCM group, and the feature name was Autocorrelation.

Fig. 1
figure 1

Illustration of the manual segmentation of three types of hematomas. CAA-H on NECT (a) and the region of interest (ROI) masked in blue (b). HP-H (c) and segmented ROI in green (d). (e) ROI of an AVM-H. Note that on (f), part of the dilated veins was embedded in hematoma, resulting in an indented medial-posterior border of the hematoma

To lower variations during the manual segmentation between radiologists, we calculated the intraclass correlation coefficient (ICC) for each feature and only those with high stability (ICC > 0.8) entered following feature selection and modeling process [7].

Feature selection of informative radiomics features

At first, we performed univariate analysis for each feature and those with p values < 0.1 were selected [11]. Univariate analysis was followed by 11 filter-based feature selection (FS) methods like Relief, aiming to remove irrelevant features before feature sorting. Therefore, we increased p value to 0.1. The filter methods calculate a relevance score and the low-scoring features will be removed. Nine of them were univariate methods, including gini index (GINI), relief (RELF), information gain (IFGN), gain ratio (GNRO), Euclidean distance (EUDT), F-ANOVA (FAOV), t test-score (TSCR), Wilcoxon rank sum (WLCR), and fisher score (FSCR). The other two were multivariate method, including mutual information (MUIF) and MRMR (MRMR) [14, 15]. FS methods including GINI, RELF, IFGN, GNRO, and EUDT were performed by R software package “CORElearn” by the “attrEval” function. FAOV and MUIF were conducted using the feature_selection module in sklearn (f_classif and mutual_info_classif). As for the computation of the ranking criterion of WLCX, FSCR, and TSCR, formula was detailed in previous studies by Parmar et al [16]. And MRMR algorithm was implemented by importing the “pymrmr” package in Python [14]. We selected features according to rankings in their own group instead of rankings among all features since this enabled a systematic description of different aspects of the hematomas and avoided selecting features from a certain feature group. After we incrementally selected features with an increment of six features, we found that by selecting two features from each group can we have the best predictive performance of most classifiers. If none of the features in a certain group passed the univariate test, then no features were selected from this group. Raw feature vectors were further standardized by being centered to the mean and scaled to unit variance.

Machine learning and model performance evaluation

We applied eight supervised machine learning algorithms; these classifiers were neural network (NN), decision tree (Decision Tree), Adaboost classifier (AD), naïve Bayes (NB), random forest (RF), logistic regression (LG), support vector machines (SVM), and k nearest neighbors (KNN). These classifiers were all imported from a Python (version 3.6.4) machine learning library named scikit-learn (version 19.0) [17].

In combination of 11 FS methods and 8 classifiers, we built 88 (11 × 8 = 88) models. The nomenclature of each model combined two elements, including the FS methods and the name of the classifiers. For example, “GINI_KNN” was a model trained by a k nearest neighbors classifier, with radiomics features selected by the gini index.

Each of the 88 models was trained and threefold cross-validated on the training dataset using a StratifiedKFold iterator in scikit-learn. StratifiedKFold is a variation of k-fold cross-validation that ensures each set contains approximately the same percentage of samples of each target class as the whole training dataset. Predictive performance of the classifiers and their stability was evaluated by area under the curve (AUC) and relative standard deviation (RSD), respectively. RSD was defined as

$$ RSD=\left(s{d}_{AUC}/ mea{n}_{AUC}\right)\times \ast 100 $$

where sdAUC and meanAUC were the standard deviation and mean of the threefold cross-validated AUC values. The lower the RSD value, the more stable the predicting model. Trained models were then tested on an independent test dataset and their classification performances were also evaluated by AUC. Models with cross-validated AUC over 0.980 and the RSD under 1.000 were selected and the model which had the highest AUC value on the test dataset was selected as the final model. Confusion matrix-derived metrics, including accuracy (ACC), sensitivity, specificity, positive prediction value (PPV), and negative predictive value (NPV), were calculated to further evaluate the selected model. FS and machine learning were conducted using R software (version 3.4.0, R Foundation for Statistical Computing, Vienna, Austria) and Python (version 3.6.1). Continuous variables were presented by using median with interquartile range (IQR) and the statistic difference was compared by Wilcoxon rank sum test. For differences in categorical variables, Fisher’s exact test was adopted and the results were listed in the form of number of events followed relative frequencies (%). A two-sided p value of < 0.05 was used as the criterion to indicate a statistically significant difference.

Results

Selected stable features

In Fig. 2, ICC of radiomics features between neuroradiologists in each group were overviewed. High stability (mean ± SD: ICC > 0.8) was observed for shape features (ICC = 0.88 ± 0.14), GLRLM-derived texture features (ICC = 0.83 ± 0.22), and LoG features (ICC = 0.88 ± 0.16). On the contrary, GLCM-derived texture features (ICC = 0.63 ± 0.34), first-order statistics (ICC = 0.57 ± 0.37), and wavelet (ICC = 0.46 ± 0.26) features showed only fair to good stability. In total, 159 of the 576 (27.6%) extracted radiomics features showed high stability (ICC > 0.8), including 12 shape features, 7 first-order statistics features, 12 GLCM-derived texture features, 12 GLRLM-derived texture features, 49 LoG features, and 67 wavelet features.

Fig. 2
figure 2

Boxplot of ICC of features extracted from 6 feature groups

Model performance evaluation

In Fig. 3a, mean AUC scores of the 88 models were presented in the heatmap. AUC scores ranged from 0.634 to 0.988; the median value was 0.931 (IQR 0.880–0.948). The Adaboost classifier outperformed other classifiers and the median value of AUC of the 11 models using Adaboost classifier reached 0.969. With regard to prediction stability, the 11 Adaboost classifiers had RSD of 0.675 (IQR 0.497–2.438), which was the lowest compared to other classifiers (Fig. 2b). Finally, we selected classifiers with favorable performances based on the criterion that the cross-validated AUC scores should be over 0.980 and the RSD under 1.000. In total, three classifiers met the criteria; these classifiers were RELF_Ada (AUC 0.988, RSD 0.062), FAOV_Ada (AUC 0.982, RSD 0.594), and FSCR_Ada (AUC 0.982, RSD 0.594). On the independent test dataset, AUC scores of the three selected models were 0.908, 0908, and 0.957 for FAOV_Ada, FSCR_Ada, and RELF_Ada model, respectively. Therefore, we chose RELF_Ada as the optimal model. We then evaluated the confusion matrix-related classification metrics of RELF_Ada. The accuracy of classification was 0.926. Sensitivity, specificity, PPV, and NPV were 0.889, 0.937, 0.800, and 0.967 respectively. The cross-validated AUC scores, AUC curve on the test datasets, and confusion matrix with normalization were shown in Fig. 4a–c.

Fig. 3
figure 3

Heatmaps illustrating the predictive performance (AUC) of different combinations of feature selection methods (rows) and classification algorithms (columns). (a) Cross-validated AUC values of 88 models on the train and validation dataset. (b) RSD values of 88 models on the train and validation dataset

Fig. 4
figure 4

a Illustration of the threefold cross-validated ROC curve of model RELF_Ada. b ROC curve of RELF_Ada on the test dataset. c Confusion matrix with normalization of RELF_Ada

In order avoid biases and further testify the success of the radiomics model, we compared the performance of RELF_Ada with experienced and inexperienced raters on the 81 test cases. As was shown in Table 2, RELF_Ada was superior to interventional radiologists (IR) in accuracy and specificity, indicating that the IRs tended overestimate the rate of AVM-H. On the contrary, the sensitivity of neuroradiologist (NR) was much lower compared with RELF_Ada (0.50 vs 0.89, p = 0.027), indicating that NR tended to underestimate the rate of AVM-H.

Table 2 Comparison of prediction performance between our model and radiologists

On the test cases, two of them were false negative by the RELF_Ada model (AVM-H misclassified as other types); three human raters all failed both cases. The two cases were uniform in texture and had no special radiological characteristics like indented border or calcification to aid the diagnosis (Fig. 5a and d). On the DSA image, we recorded the size, number of drainage veins, the largest diameter of drainage vein, and drainage pattern (superficial vs deep) between the 2 cases (misclassified) and the other 16 AVM hematoma cases (correctly classified); however there were no special findings (Fig. 5b, c, e, f).

Fig. 5
figure 5

Illustration of the AVM-H misclassified by the RELF_Ada. (ac) NECT and lateral and anteroposterior views of DSA for case 1. (df) NECT and lateral and anteroposterior views of DSA for case 2. In both cases, the hematomas were uniform in texture. The niduses were composed of fine vessels and dilated vein were not present

Analysis of the selected radiomics features

In total, eight features were finally incorporated into the RELF_Ada model. The median value of these selected features and comparison between different hematoma groups were illustrated in Fig. 4. Four of the 8 features had smaller median values in AVM-H, they were age (26.50 vs 57.00, p < 0.001), original_firstorder_Entropy (−0.56 vs −0.01, p = 0.003), LoG_glcm_Idm (−0.34 vs 0.53, p < 0.001), and wavelet-LLL_glcm_Idm (−0.42 vs 0.44, p < 0.001). The other four features, namely, original_shape_Maximum2DDiameterSlice (0.38 vs −0.11, p = 0.003), original_shape_Maximum3DDiameter (0.36 vs −0.09, p < 0.001), LoG_glrlm_RunPercentage (0.69 vs −0.59, p < 0.001), and wavelet-HLL_glrlm_RunLengthNonUniformityNormalized (0.66 vs −0.49, p < 0.001), had larger medians in the AVM-Hs when compared to hematomas of other etiology (Fig. 6).

Fig. 6
figure 6

Boxplot illustrating the difference of the 8 features that finally include in RELF_Ada between AVM-H and hematomas caused by other etiologies. (a) Age. (b) original_shape_Maximum2DDiameterSlice. (c) Original_shape_Maximum3DDiameter. (d) original_firstorder_Entropy. (e) LoG_glcm_Idm. (f) LoG_glrlm_RunPercentage. (g) wavelet-HLL_RunLengthNonUniformityNormalized. (h) wavelet-LLL_glcm_Idm

Discussion

In this retrospective study, we constructed a stable predictive model with clinical feature age and seven radiomics features extracted from NECT scan to discriminate AVM-related hematomas. Favorable predictability was achieved in that the classification accuracy reached approximately 93%. Since NECT imaging is non-invasive, requiring no contrast injection, and time-saving, this technique provided us with a fast and auxiliary approach to the diagnosis and evaluation of etiology for patients with acute IPH at emergency room.

From the perspective of treatment, early detection of AVM-H is important for at least three reasons. First, microsurgical elimination of the nidus may encounter unexpected difficulties. Unless the space-occupying hematoma requires an instant decompression procedure, for acutely ruptured brain AVMs, we generally waited until the hematoma liquefied so as to have an operative space [18]. According to the study of Barone et al [19], compared to surgery in the subacute stage, resecting AVM and its associated space-occupying hematoma in the acute phase had much worse outcome (52% vs 93%). Therefore, we should avoid misdiagnosing an AVM-H as HP-H and operated it in acute phase. Secondly, AVM nidus with hematomas may be CTA-negative and questionable, thus requires further DSA examination [20,21,22]. Even on DSA image, possibly due to hemodynamic remodeling and hematoma compression, some small AVM nidus may be occult and only appear on delayed DSA. So, radiomics-based classification model actually offered additional information for diagnosis. Thirdly, for hematomas of which the volume does not reach the criterion for open surgical evacuation, we should screen out AVM-H for embolization of the nidus to avoid rebleeding instead of leaving them on conservative treatment like other hematoma types.

In the clinical practice, radiologist can make an empirical differential diagnosis based on several factors, including location of hematoma, age of patients, and hematoma morphology-like indented border. Although AVMs may have ventricular extension or form hematomas in the basal ganglia, the most frequently seen hematoma type in this area is HP-H [23]. While for CAA-H, hematomas were generally located at the cortical-subcortical area, sparing the deep white matter and basal ganglia [11]. Besides, the intensity of the AVM-associated hematomas tends to be more irregular and heterogeneous, presumably caused by the presence of calcification and embedded malformed vasculatures in the hematomas. On the contrary, hypertensive hematomas are more likely to be uniform in shape and CAA-related hematomas usually manifest irregular borders [24]. Apart from these image features, some clinical features were statistically different between patients with AVM-related hematomas and other types. According to a recent population-based study that compared outcomes after different hematoma types, patients with AVM were younger and had a gender predilection towards male compared to patients without AVM [25]. However, those diagnoses were imprecise as manifested in our comparison between human raters and radiomics models. The location of large hematomas was ambiguous, the ages of patients were overlapped, and the heterogeneity of hematomas was hard to discern by naked eyes. What’s more, the classification was influenced by rater’s academic background. Interventialists inclined to diagnose HP-H and CAA-H as AVM-H related while neuroradiologists made a more prudent diagnosis yet lowered the detection rate of AVM-H.

Radiomics feature-based model succeeded in variety of tumor-related predictions [11, 26,27,28,29,30] but has rarely been used in the analysis of vascular diseases. Based on our limited knowledge, by far, only two studies applied radiomics technique in hemorrhagic and ischemic lesions. One study quantified the radiomics feature of symptomatic plaques in basilar artery and the other one investigated the texture heterogeneity of expanded hematoma [31, 32]. Our study aimed to use radiomics to quantify image characteristics of IPH, and according to our results, feature glcm_Idm had smaller values on both the LoG and wavelet images for AVM-H. This coincided with our experiences that AVM-H were more heterogeneous, since Idm is a measure of local homogeneity of an image and decreased Idm values denoted heterogeneity. What’s more, the feature wavelet-HLL_glrlm_RunLengthNonUniformityNormalized measures the similarity of run lengths throughout the image. Hematomas caused by other etiologies had lower values, indicating more homogeneity among run lengths in the image. Feature LoG_glrlm_RunPercentage had larger values in AVM-H, indicating a coarseness of hematoma texture. In summary, included radiomics features indicated that AVM-related hematomas tended to be larger in diameter, coarser in texture, and more heterogeneous in composition. Four cases of HP-related hematomas were misclassified by the RELF_Ada model (false positive). The possible account for this was that heterogeneity of HP hematomas increased when the hematoma expanded, causing mixture of hematomas which formed at different time points. Since hematoma expansion occurred most frequently in acute phase (< 6 h) [33], we hypothesized that the false-positive predictions might involve HP-H that had already expanded at acute phase. However, this hypothesis should be further tested in future prospective studies with stratified time to NECT and adequate number of false-positive predictions to draw a statistical conclusion.

Our retrospective study constructed a predictive model that can distinguish AVM-H from HP-H and CAA-H. However, there were two aspects that require further improvement. Firstly, our data was relatively small, baseline characteristics may be not in accordance with population-based dataset. For example, patients with AVM-H were much younger compared to a study which included data from a nationwide inpatient sample [25]. Secondly, our model lacks external validation. Future studies should be conducted on population-based dataset, so that we can obtain adequate number of false predictions to further explore the relationship of nidus angioarchitecture and radiomics features. Besides, new radiomics features like Neighbouring Gray Tone Difference Matrix (NGTDM) and novel image filters should be tested and convolutional neural network-based models can be also attempted to fulfill this classification task.

Conclusions

In this retrospective study, Adaboost classifier trained with radiomics features extracted on non-contrast-enhanced CT scan and age feature demonstrated high accuracy in identifying AVM-H from IPH of other etiologies. Prospective studies are needed to further validate its classification ability.