Introduction

Vertebral compression fractures can have pathologic, traumatic, or atraumatic causes. The latter may occur after low stress due to bone mineralization loss leading to reduced mechanical bone strength [1]. The most common underlying causes are osteopenia and osteoporosis [2], two systematic skeletal disorders affecting especially the elderly and chronically ill. Osteopenia and osteoporosis are characterized by the loss of bone tissue, skeletal fragility, and microarchitectural deterioration [3].

Osteoporotic vertebral compression fractures affect many patients worldwide, entailing significant morbidity and mortality: In 2000, 1.4 million vertebral fractures were estimated globally and approximately 214,000 occurred in the USA with patients at age 50 or older [4]. The lifetime risk of a vertebral fracture at age 50 is 15.6% for women and 5% for men [5].

In general, the diagnosis of vertebral compression fractures is based on X-ray examinations. However, a substantial amount of cases may be overlooked [6]. CT or MR imaging may be more appropriate to detect subtle cases.

In clinical practice, fracture risk is usually determined by dual-energy X-ray absorptiometry (DXA) [1]. The trabecular bone score based on gray-level textural metric can be extracted from DXA images and improves the fracture risk assessment [7]. Several studies demonstrate that automated bone mineral densitometry based on Hounsfield unit (HU) on clinical CT images is feasible [8], and that it correlates well with DXA measurements [9]. Bone microstructure can be assessed ex vivo with microcomputed tomography and in vivo with high-resolution peripheral quantitative computed tomography (HR-pQCT) at comparable resolution [10, 11]. Combination of HR-pQCT and finite element analysis (FEA) determines stresses in human bones, permits highly accurate estimation of individual fracture risk, and predicts fracture sites [12]. Recent studies implemented FEA in clinical CT scans [13]. However, these techniques require dedicated hardware and software which may hamper their application in clinical practice. Texture analysis (TA), on the other hand, is an objective and quantitative method to analyze the distribution and relationship of pixel or voxel gray levels in an image or volume [14], which can be applied retroactively to standard CT scans. Feasibility of TA on bone structure has been demonstrated for radiographs [7, 15,16,17] and CT scans [18, 19].

We hypothesize that TA and machine learning (ML) allow to predicting vertebral insufficiency fracture in standard CT. The goal of this retrospective case-control study is to evaluate the diagnostic performance of bone TA combined with ML algorithms in standard CT scans to identify patients with vertebrae at risk for insufficiency fractures.

Materials and methods

Study approaches

In a paired control study (approach A), vertebrae from patients developing vertebral fractures (cases) were compared to vertebrae of matched controls with normal bone density. In a within-subject study (approach B), it was investigated if it is possible to predict whether or not vertebrae will fracture. In both approaches, vertebrae were classified using ROC analysis of HU measurements and supervised ML techniques. Figure 1 depicts the two separate approaches, A and B.

Fig. 1
figure 1

Schematic shows our classification approaches to our study illustrated with an example. CT scan of a matched control (a.). Primary CT scan of a 59-year-old female subject (b.). Secondary CT scan of the same subject 4 months later with a newly occurred insufficiency fracture of vertebra L4 (c.). Cropped images of vertebra L2 (d1.) and L4 (d2.) of the matched control. Cropped image of vertebra L2 (e.) which remains intact in the secondary scan and of vertebra L4 (f.) which is broken in the secondary scan. Classification approach A is a paired control study comparing the study cohort (cases) with an external cohort presenting a normal BMD with DXA (controls). Approach B is a within-subject vertebra comparison between unstable vertebrae and intact reference vertebrae of the study cohort

Study population

The study received institutional review board and local ethics committee approval. We identified 30,931 patients above 45 years undergoing a clinical CT scan that covered at least the thoracic or lumbar spine between January 2006 and December 2013 from the institutional PACS. Six hundred seventy patients were selected that received at least two CT scans within a year and a third scan at least 5 months after the second scan. The third scan was assessed for validation of the stable vertebra over a longer period. Eventually, 58 patients remained after a review for newly occurred osteoporotic vertebral compression fractures of the thoracic and/or lumbar spine using established criteria [20, 21] and excluding patients with traumatic fractures and metastasis of the spine (Fig. 2). These patients had two consecutive scans, showing intact vertebrae in the first scan that either fractured (“unstable” vertebra n = 60) or remained intact (“stable” vertebra n = 60).

Fig. 2
figure 2

Flowchart of the patient selection process

We divided the spine into the following regions: thoracic spine (Th1-Th10), thoracic-lumbar junction (TLJ, Th11-L1) and lumbar spine (L2-L5). Stable and unstable vertebrae were selected from the same region. As a control set, images of 58 patients from a previous study with patients presenting a normal bone mineral density (BMD) with DXA were matched by age, sex, and region of the spine [18]. The mean age was 70 ± 9 years (range 48–90 years), including 26 women and 34 men in both the patient and the control groups. In the female subgroup, the mean age was 69 ± 10 years (range 48–89 years) and 71 ± 9 years (range 53–90 years) in the male subgroup. The age difference between these subgroups is not significant (p = 0.519). A flowchart of the patient selection is shown in Fig. 2.

CT data and post processing

Sagittal image stacks covering the spine were retrieved from the PACS and saved in uncompressed DICOM format. The images were acquired with different CT scanners: the SOMATOM-Force, Definition, Definition Flash, Definition AS and Sensation CT (all from Siemens). The collimation of the scanners ranged from 0.6 to 1.2 mm. The image section thickness of the sagittal images was 2.0 mm, and kilovoltage peak (kVp) ranged from 90 to 140. All images had been reconstructed using an edge-enhancing bone kernel and were rescaled to the coarsest in-plane resolution of 0.5 mm. No low-dose protocols were included.

Image classification

One radiologist (UJM, with 2 years of experience in skeletal radiology) manually segmented trabecular bone of stable and unstable vertebrae on single, mid-sagittal images of the primary scans, using freehand-drawn regions of interest (ROIs). TA was performed using open-source software (MaZda, version 4.6) [22] with the identical technique as with the control subjects [18]. The TA software calculated 305 features from 6 different statistical image descriptors (Supplements Table 1) for each ROI. Since ROI characteristics (e.g., location, size, shape) can influence texture features, only reproducible features (i.e., features with excellent intra- and inter-reader agreement (intraclass correlation coefficient (ICC) ≥ 0.81) that were defined in a previous study [18] were included. Mean CT HU values were obtained from the identical ROIs that were used for TA feature extraction.

Human readout

Two radiologists (RG, 15 years and ASB, 3 years of experience in skeletal radiology) visually rated vertebral trabecular bone texture with respect to number, length, and thickness of bone trabeculae, using a 5-point Likert-like scale (1, age-appropriate; 2, rather age-appropriate; 3, unsure; 4, rather age-inappropriate; 5, age-inappropriate bone texture) in approaches A and B, based on reported correlations of structural trabecular bone appearance and fracture risk [23,24,25]. In these vertebra-based analyses, the cropped vertebral images showing only the spongiosa were presented with identical windowing (HU width 1720, length 535) in random order as mid-sagittal reformations on a standard reporting workstation in two separate readout sessions for approaches A and B. Readers were allowed to change the window settings and take ROI measurements. Figure 1d–f shows examples of the cropped images.

Statistical analysis

The statistical analysis was performed in R version 3.4.2 (R Foundation for Statistical Computing). The mean and standard deviation of the mean, median, and interquartile range and range were used for descriptive statistics of continuous variables, where appropriate. The chi-square test was used to compare ordinal and nominal protocol parameters. The Mann-Whitney U test was used to investigate the influence of protocol settings on TA features. Tenfold cross validation with stratified sampling was used with 1/3 of the data as test set and 2/3 as training set. The features and folds were consistent across all classifiers. Data standardization using data scaling and data centering and removal of redundant features (Pearson correlation coefficient R ≥ 0.80) were used as pre-processing. Selected features were compared by using the Wilcoxon test and co-correlation was assessed with Pearson correlation. For classification, the following ML classifiers from the caret package version 6.0-77 [26] were used: multi-layer perceptron (MLP) with 3 hidden layers, artificial neural networks (ANN) with a single hidden layer, random forest (RF), support vector machine (SVM) with linear kernel and naïve Bayesian classifier. Feature importance was calculated for ANN using the method of Gevrey et al [27], for RF using permutation, and for SVM ROC analysis. ROC analysis was performed for ML classifiers and the human readers for approaches A and B. Diagnostic accuracy was expressed as the AUC. The nonparametric test by DeLong et al was used to compare AUCs [28]. P < 0.05 was considered indicative for significant differences, with Bonferroni correction for multiple comparisons where appropriate. All tests were two-tailed. Inter-reader agreement was assessed using the concordance correlation coefficient (CCC) [29].

Results

Study population

Sixty stable and 60 unstable vertebrae of 58 patients were included. Twenty-two vertebrae of the thoracic spine, 24 of the TLJ, and 14 of the lumbar spine region were affected. Reasons for referral for the primary CT scan were oncologic diseases (n = 23), status after vascular repair (n = 22), acute gastrointestinal problems (n = 7), non-oncologic diseases of the lung (n = 5), and suspected fracture (n = 3). The mean time difference between the primary and secondary scan was 237 ± 117.5 days, between latter and third 548 ± 386 days.

CT data and post processing

We found no significant difference in contrast agent administration among case and control cohort (p = 0.432). However, there was a significant difference in tube voltage among case and control cohort with higher tube voltage in the case cohort (p = 0.019). Figure 3 depicts the used kVp in the study and control cohort. Furthermore, we found no significant influence of the tube voltage on the selected features within the study cohort (U ranged from 1968 to 3419, p ranged from 0.184 to 1) or the control cohort (U ranged from 5269 to 6667, p ranged from 0.120 to 1).

Fig. 3
figure 3

Barplots show the distribution of applied kilovoltage peak (kVp) in the case and control cohort

Image classification

After removal of 88/305 (28.9%) features with poor reproducibility and 188/217 (86.6%) redundant features, 29 features were considered for classification in both approaches, A and B. In approach A, 16/29 (55.2%) texture features showed a significant difference between cases and controls after Bonferroni correction. There were no significant differences between selected texture features in approach B after correction. Table 1 depicts means and differences of all selected features.

Table 1 Differences of means of selected texture features for approaches A and B

In approach A, ROC analysis using mean HU values for classification yielded AUC of 0.83 (95% confidence interval [CI], 0.77–0.88). All ML classifier yielded higher accuracy (AUC ranging from 0.88 to 0.96, Table 2), and the highest test accuracy was achieved using SVM with AUC of 0.97 (CI, 0.96–0.98). There were no significant differences between AUC of SVM and RF (p = 0.07). AUC of SVM was significantly higher compared to MLP, ANN, and naïve Bayesian classifier (p = 0.029, p < 0.001, and p < 0.001, respectively). AUC using SVM and texture features was significantly higher than ROC analysis using mean HU values (p < 0.001). ROC curves of ROC analysis and ML classification are shown in Fig. 4. Figure 5 demonstrates estimated feature importance for the MLP, RF, and SVM classifiers. Of the combined top 10 important features of all ML classifiers, 3/17 (17.6%) belong to the image histogram (IH)-, 2/17 (11.8%) to the gray-level run-length matrix (GLRLM)-, 4/17 (23.5%) to gray-level co-occurrence matrix (GLCM)-, 3/17 (17.6%) to autoregressive model (AR)-, and 5/17 (29.5%) to wavelet transformation-derived features, respectively. Combined top 10 important features showed low co-correlations as demonstrated in Fig. 6.

Table 2 Classification and human readout results
Fig. 4
figure 4

Upper row shows receiver operating characteristic performance for classification in case/control (a) and stable/unstable (b) vertebrae using Hounsfield units measurements solely. Lower row shows receiver operating characteristic performance for classification in case/control (c) and stable/unstable (d) vertebrae using 29 texture analysis features and 5 machine learning classifiers

Fig. 5
figure 5

Barplots show estimated texture feature importance for classification in case/control vertebrae for random forest classifier (a), artificial neural network classifier (b), and support vector machine classifier (c)

Fig. 6
figure 6

Heatmap shows the combined top 10 important features of all classifier for controls (a) and cases demonstrating low co-correlation between features (b)

In approach B, ROC analysis using mean HU values for classification yielded AUC of 0.52 (CI, 0.42–0.63). All ML showed low accuracy (AUC ranging from 0.50 to 0.64), and the highest test accuracy was achieved using SVM with AUC of 0.64 (CI, 0.61–0.67). SVM classified stable and unstable vertebra significantly better than mean HU values (p = 0.029).

Human readout

Diagnostic accuracy in approach A was low for both readers with AUC of 0.57 (CI, 0.50–0.63) for reader 1 and 0.48 (CI, 0.41–0.55) for reader 2. However, reader 1 classified cases versus control images significantly better than reader 2 (p = 0.03), but only slightly better than chance. Both readers showed random within-subject classification of stable versus unstable vertebrae (reader 1 AUC 0.53, CI 0.44–0.62 and reader 2 AUC 0.49, CI 0.39–0.59), with no significant difference between the readers (p = 0.569). Inter-reader agreement was low for both approaches A (CCC 0.12, CI 0.04–0.21) and B (CCC 0.15, CI 0.05–0.25). Table 2 summarizes test accuracy (AUC) of automated and human reader image classification.

Discussion

In this study, we investigated the use of TA combined with ML in clinical CT scans for the differentiation of patients as well as vertebrae at risk for fracture. We found that an SVM classifier utilizing 29 texture features yielded a high AUC of 0.97 for identifying patients at risk for insufficiency fractures. Differentiation using mean CT density measurements alone yielded an AUC of only 0.83 (p < 0.001). Given the same data, an experienced radiologist reached a performance only slightly better than chance (AUC = 0.57). However, neither TA/ML nor the radiologists could reliably distinguish between the single vertebral bones at risk for fracture and the neighboring segments in the same individuals.

Increased computational power and the successful development of new algorithms in the last decades have led to promising approaches in various fields of radiology [14, 30] and TA has been shown to be efficient in the differentiation of osteoporotic and healthy subjects in HR-pQCT data [31].

In our study, the combination of ML and TA revealed several important TA features when discriminating between healthy and diseased bones. The mean signal intensity crystallized is one of the most important factors in our analyses. This is also consistent with the literature, reporting that bone mineralization is an important factor for bone strength [1, 32]. However, as already previously suggested [15, 33, 34], the increased performance of models including TA features, as well as the missing correlation with the mean signal, supports the notion that these features are independent surrogate markers for trabecular microarchitecture.

Particularly, complex TA features that showed a high importance in our classification (see Fig. 5) could be linked to known pathomorphological changes in the osteoporotic bone: Some GLRLM features (e.g., dgr45ShrtREmp, Table 1) that are defined over information of consecutive pixels of the same value in a given direction are negatively correlated with trabecular bone volume measured by histomorphometric methods, which is increased in osteoporotic individuals [34, 35]. Moreover, some wavelet features (e.g., WavEnHLs_2, Table 1) showed an important role in our classification. Because of their filter function, they can be considered as a breakdown of an image into a set of spatially oriented frequency channels useful to detect horizontal and vertical lines in images as well as crossings and corners. Wavelet transformation-derived texture features were previously reported to enhance the robustness and accuracy in trabecular bone classification on radiographs [17]. Moreover, they may aid detection of trabecular bone lesions in CT [36]. The GLCM feature entropy (S05SumEntrp, Table 1) that is defined over the distribution of co-occurring pixel values at a given offset was also of great importance in our classification. Entropy has been suggested as impaired skeletal integrity [19]. Further TA features, such as skewness (a measure of the asymmetry of the image histogram) or kurtosis (a measure of the peakedness of the image histogram), did not show a high importance in our classification.

In our study cohort, no reliable discrimination between a vertebra that will fracture within the following 4–12 months and a vertebra that will remain intact in the same patient was possible applying TA and ML. Several factors could have led to these non-significant results: For example, the differences might be too small for the sample size of this study. Alternatively, external factors, such as static or kinetic effects that have a determining influence on site of fracture [37] may not be contained in the data. Furthermore, standard feature extraction used in this paper could have limited discrimination of small differences. A recent study using deep features extracted from convolution neural networks, for example, revealed improved results in osteoporosis classification on X-ray images [38].

Our results from human reader analyses suggest that healthy and osteoporotic trabecular bone shows little visual differences leading to a low accuracy regarding the identification of osteoporotic bone as well as limited inter-reader agreement. Similar results were reported for radiographs [39]. However, in a real clinical setting, the accuracy of bone quality estimation is usually higher since visual signs of osteoporosis, such as vertebral cortical bone thinning and fractures of adjacent vertebrae in a larger spine region, can be considered.

In this retrospective study, we consider the male predominance in the case group (26 women and 34 men) as a selection bias since women have a higher prevalence and incidence of osteoporotic vertebral fractures. Male predominance could partially be explained by the inclusion criteria of 2 CT scans within a year that lead to a high number of patients with vascular repair (22/58).

There are several other limitations to our study. First, this was a retrospective case-control study, using a relatively small data set of 240 vertebrae of patients from a single academic institution. Second, the included CT images in our studies were acquired using various CT imaging protocols and different CT scanners, which may introduce biases to the texture features. Third, our analysis did not include other factors like cortical thickness or vertebra size, which are reported to be important in assessing bone quality. Fourth, TA features extraction in our study were limited to the features available in the MaZda software, and more recent methods for feature extraction in bone TA were not included [40]. Lastly, pre-built machine learning models in our study were not thoroughly and individually “fine-tuned” with, e.g., a grid search, since this was not necessary for our proof-of-principle study.

In conclusion, TA combined with ML allows to identifying patients who will suffer from future vertebral insufficiency fractures in standard CT scans with a high accuracy. However, identification of single vertebra at risk remains challenging.