Introduction

In 2010, an estimated 158 million individuals were at high risk of osteoporotic fracture worldwide, doubling by 2040 [1]. Despite the US Preventive Services Task Force recommendations, osteoporosis screening is underutilized [2]. Incident osteoporotic vertebral compression fractures increase the risk of future fractures, including hip fractures which have a 20–40% 1-year mortality [3, 4].

Given the morbidity and mortality associated with osteoporotic fractures and the high prevalence of disease, better detection could benefit many subjects. Fractures are often incidental, subtle, and under-reported, resulting in under-diagnosis and under-treatment [5]. Automated opportunistic screening of existing data could markedly improve the standard of care by improving screening rates and osteoporosis diagnosis at minimal added cost. Unlike prior predictive models that use covariates to predict future risk of fracture based on risk factors, or use CT data to predict BMD and osteopenia, or use CT data to screen for existing fracture, we seek to use radiographs to screen large populations for the presence of current or prior osteoporotic fracture [6,7,8]. In this body of work, we take fracture scores for each vertebra on a given radiograph from a deep learning model and subject data to predict whether there are one or more fractures on a radiograph.

Previously, we used a dataset from The Osteoporotic Fractures in Men (MrOS) study consisting of images of individual vertebra from each radiograph (vertebral patches) and a dichotomized Genant Semiquantitative method classification system to train a vertebral fracture detecting deep learning model [9, 10]. Final AUC-ROC performance was > 0.990 at the vertebral level. However, for clinical use as an automated screening tool, the decision to alert a clinician must be calculated at the subject (not vertebral) level. Careful consideration is needed to minimize false positives and the associated unnecessary workup, cost, effort, and stress.

One strategy for combining vertebral fracture scores into a binary subject-level output (zero versus one or more vertebral fractures) is to threshold the predicted fracture probabilities and then aggregate all vertebrae for a subject. The drawback of this approach is that the predictive accuracy established for a single vertebral model could be decreased in the subject-level model, because multiple outputs are combined into a summary prediction. For example, if we consider a ten-vertebrae model with an initial fracture model that controlled the false-positive rate at 1%, then using a rule of “any fractures” across the ten vertebrae would at best preserve the 1% error rate if outcomes are highly correlated across individual vertebra, or it could inflate the subject-level false-positive rate to nearly 10% if false-positive predictions across multiple vertebrae were independent (Supplementary Information Fig. SI1). Therefore, a subject-level analysis is needed to quantify and control the subject-level error rate to minimize false positives.

With the demographic data available for the MrOS subjects, separate vertebral fracture scores can be combined into subject-level fracture predictions. For clinical use, we focused on two objectives: (1) comparison of vertebral fracture score modeling strategies for subject-level fracture prediction and (2) incremental predictive value of readily available demographic covariates.

Methods

Data

The MrOS study, a prospective longitudinal cohort study, began in 2000 [11,12,13]. Six US academic medical centers contributed data under local IRB approval with written consent. A total of 5994 males, 65 and older, contributed clinical, laboratory, and imaging data at an initial (Visit 1) and follow-up visit (Visit 2), 4.5 years apart on average. Since there are roughly two visits for every subject, there are approximately double the number of vertebral patches compared to vertebral bodies. Covariates (age, ethnicity) were self-reported and collected at baseline. Summary data and demographics for the 669 subjects in the randomly selected test dataset (from MrOS) used in this study are shown in Table 1. The binarized form of classification used is Class 0 defined as Genant SQ class 0, Class 1 everything else.

Table 1 Basic demographic and summary information of the dataset

This secondary analysis uses individual vertebral fracture scores for the randomly selected 15% held-out test set from a deep learning (GoogLeNet) model described in a separate body of work [10, 14]. That vertebral model was trained and validated on the segmented thoracic and lumbar lateral radiographs from the other 85% of MrOS subjects with transfer learning from ImageNet [15], where training data was subsampled to enrich the ratio of class 0 to class 1 vertebrae to 2.5:1. Each radiograph was converted to a 16-bit image and, if necessary, flipped so the subject faced left. For each lateral spine radiograph, each vertebral body and a margin of surrounding pixels, often including the adjacent endplates, were extracted to create vertebral patches. The patches were resized to 224 × 224 pixels. Pixel intensities were normalized. The pre-processed vertebral patches were then used to train a GoogLeNet model using several techniques including image augmentation, transfer learning, Adam learning rate optimization, and early stopping. Hyper-parameters were tuned using a random search.

Vertebra above T4 and below L4 are excluded by MrOS, resulting in 13 usable levels. Ground truth for the presence of one or more fractures in a subject was defined by the maximum SQ annotation across these 13 vertebrae. Details on the Genant semiquantitative fracture classification system and the distribution of anatomic fractures across vertebral levels are in Fig. 1.

Fig. 1
figure 1

The Genant semiquantitative osteoporotic fracture classification system classifies osteoporotic vertebral compression fractures by the degree of height loss and the change in vertebral morphology with either loss of height anteriorly, centrally, or posteriorly. The degree of height loss is estimated visually and is complicated by the narrow range of height loss inherent to the different rows (20 to 25%, 25 to 40%, or > 40%). To focus on the clinically relevant endpoint of whether there is confidence that an osteoporotic fracture is present, this classification system is simplified into a binary classification system by grouping the moderate and severe deformities together (orange class 0: SQ 3, 4, 4.5, 5, 6, 7) and grouping the normal and mild deformity classes together (yellow class 1: SQ 0, 1, 2, 2.5). The frequency of different Genant class fractures (SQ 1, 2, 2.5, 3, 4, 5, 6, 7) is plotted by spinal anatomic level. A bimodal distribution of fractures can be seen centered around T12-L1 and T7. The division of anatomic levels into five distinct regions used in model development is demonstrated on the radiograph

Missing data

Due to technical or anatomic factors, 111 (0.73%) of the vertebrae did not have associated scores. To build subject-level models, missing vertebral fracture scores were imputed [16]. Multiple strategies were compared: random forests and either predictive mean matching (PMM) or Bayesian linear regression through multivariate imputation via chained equations (Supplementary Information Fig. SI2) [17,18,19].

Model classes

There are many ways to combine the individual vertebral scores into a subject-level prediction of fracture. To increase generalizability and interpretability, simpler methods with fewer inputs and less flexibility were attempted first.

We fit logistic regression (LR) and generalized additive models (GAMs) using selected inputs [20, 21]. That is, for each subject \(i\) and \(p\) inputs \({x}_{i1},\dots .,{x}_{ip}\), we fit models of the form:

$$\begin{array}{cc}g\left(P\left({y}_{i}=1|{x}_{i1},\dots ,{x}_{ip}\right)\right)={\beta }_{0}+{\beta }_{1}{x}_{i1}+\dots +{\beta }_{p}{x}_{ip}& \text{LR}\end{array}$$
$$\begin{array}{cc}g\left(P\left({y}_{i}=1|{x}_{i1},\dots ,{x}_{ip}\right)\right)={\beta }_{0}+{f}_{1}\left({x}_{i1}\right)+\dots +{f}_{p}\left({x}_{ip}\right)& \text{GAM}\end{array}$$

where \(P\left({y}_{i}=1|{x}_{i1},\dots ,{x}_{ip}\right)\) is the conditional probability of any vertebral fracture given the inputs, \(g\left(\bullet \right)\) is the logit function, and \({f}_{1},\dots ,{f}_{p}\) are the nonlinear functions estimated using scatterplot smoothers. LR is the most common model for binary outcomes. GAMs are a popular extension, where the assumption that the log-odds are linear is replaced by the more general form above [22, 23]. By relaxing the linearity constraint, GAMs increase model flexibility and capture nonlinearities in the covariate effects. We focused on regression models because of the low dimensionality of the predictor space, which avoids overfitting and facilitates explainability and ease of implementation.

Model inputs

For inputs, we compared performance using the individual vertebral scores or their order statistics (the top k fracture scores at the vertebral level). Order statistics, like the maximum, are nonlinear functions of the individual scores. Interaction terms for neighboring anatomic levels were also modeled, since fractures can be subtle and adjacent intermediate predictions should increase the confidence of a fracture as multiple osteoporotic compression fractures typically are adjacent. Fracture adjacency is explored in Fig. 1 and Supplementary Information Fig. SI3.

The effect on model performance by incorporation of demographic covariates was explored. When working with medical imaging data, basic subject demographics are embedded within the DICOM metadata of the file. Thus, subjects’ age and gender are readily available. MrOS subjects are exclusively male, so age but not gender was explored as a covariate for prediction. The incidence of osteoporotic fractures has been shown to vary by race [24, 25]. Race/ethnicity was extracted (GIERACE) and tested as a predictive covariate.

Model assessment

This dataset is 15% of the original MrOS dataset and comprises 1176 radiographs on 669 subjects, with 83 subjects having one or more moderate to severe fractures. To fairly report model performance and maximize the limited data available, fivefold cross-validation was performed [26]. The dataset contains two sets of radiograph images (lumbar and thoracic radiographs from two visits for a total of four radiographs per subject) and demographic covariates for each subject. These measurements are not independent across visits. To perform fivefold cross-validation, radiographs were split into five “folds” or non-overlapping subsets by subject (Supplementary Information Fig. SI4), so measurements between the distinct subsets remained independent. Folds were stratified to ensure (1) class balance within the folds (similar rates of vertebral fracture in each subset) and (2) test sets represent the population as closely as possible. Models were then fit five times, holding out each subset (or fold) in turn as test data. Performance metrics were collected, and their means and standard errors were reported.

Several metrics were computed, but AUC-ROC was used for summary comparison. AUC-ROC was chosen, as it more completely describes a classification model’s performance and tradeoff between true-positive and false-positive rates, and is threshold- and scale-invariant. The AUC-ROC can be optimistic or misleading in settings with large class imbalances such as the one considered here, where about 12% of subjects had moderate to severe fractures. Therefore, both ROC and PR curves were calculated for models of interest. The PR curve, which considers the tradeoff between sensitivity and positive predictive value (PPV), is useful for “rare events,” as it can be more challenging to balance PPV and sensitivity appropriate for a given application of the model [27, 28]. Simply put, large differences in false positives result in more noticeable changes to the PR than the ROC curve.

Implementation

Imputation through random forests was generated with R package “missForest,” PMM, or Bayesian linear regression used “mice,” and missingness was simulated as missing completely at random using code from “R-miss-tastic” [X, X, X]. LR models were fit using the R “stats” package [X]. GAMs were built using the “gam” package [X]. Smoothing splines were fit with default parameters. ROC and PR curves were generated with the “precrec” package [X]. The code for this project is available as an R package on github.

Results

Imputation

The optimal imputation method was determined by evaluating the mean squared error on the complete data with simulated missingness (Supplementary Information Fig. SI2). PMM averaged over five imputations was selected and used to impute the small number of missing vertebral fracture scores.

Model performance

Table 2 provides an overview of our results for subject fracture prediction models using fivefold cross-validation, with multiple classification metrics listed. We report the mean and standard errors of the AUC-ROC, AUC-PR, precision or PPV, sensitivity (recall/true-positive rate), specificity, and model accuracy. In addition, the same AUC-ROC for all models across each of the fivefolds is shown to provide full transparency and capture test performance. Figures 2, 3, and 4 afford additional insights into model performance by comparing ROC and PR curves between models of interest.

Table 2 Model performance for logistic regression and GAMs across all inputs
Fig. 2
figure 2

Average ROC and PR curves for LR models using as input fracture scores from all vertebrae, the maximum vertebral fracture score, all vertebral fracture scores ranked, and the maximum of each vertebral region (as described in Fig. 1). Overall, we can see the best performance was achieved with the maximum vertebral score as input, which had better false-positive rates (FPR = 1-specificity) and precision across all levels of sensitivity

Fig. 3
figure 3

The average ROC and PR curves for the top models of both classes, LR with maximum vertebral body score and GAM with the top three maximum vertebral body scores plus age as inputs. The ROC curves are nearly indistinguishable, but slight performance differences are more noticeable on the PR curve. While the GAM was the top model in ROC-AUC, the LR with only the maximum score is worth considering for its simplicity and interpretability

Fig. 4
figure 4

ROC and PR curves for the fivefolds of the top model, a GAM using the top three maximum vertebral body fracture scores and age as input. We used fivefold cross-validation to better assess model performance, as we see here predictive performance is relatively consistent across the folds for the top model

Model inputs determined performance more than model class (LR or GAM). The increased complexity of the GAMs allowed them to match the AUCs of the top LR models, but this flexibility resulted in lower AUCs for the less informative inputs (Table 2).

The baseline model using all vertebral fracture scores as input achieved an average AUC-ROC of 0.908 with LR and 0.816 with a GAM and AUC-PR of 0.672 and 0.615, respectively. Summary statistics of those same inputs proved even more informative; using only the maximum vertebral fracture score increased the AUC-ROC to 0.966 (LR) and 0.965 (GAM) and AUC-PR to 0.859 and 0.857, respectively. Adding the next one to two highest vertebral fracture scores resulted in comparably high-performing GAMs, but adding additional order statistics beyond k = 3 showed diminishing returns. In both model classes, using all ranked vertebral body fracture predictions outperformed the baseline, but did not match models using fewer order statistics as seen in Table 2 and Fig. 2. The maximum fracture score of each vertebral region (upper thoracic, mid thoracic, lower thoracic, thoraco-lumbar junction, and lower lumbar) was also a more informative input than the baseline of all vertebral predictions, but, in LR models, was less informative than order statistics that ignored vertebral location. Specifically including vertebral location information (in the form of an indicator variable for the anatomic level of the maximum vertebral body fracture score) decreased model performance, dropping the AUC-ROC from 0.966 to 0.923. This demonstrated that the rank of the vertebral fracture predictions may be more important than the specific vertebral location.

Although the models based on all individual vertebral fracture scores were not the highest performing, incorporating interaction terms for neighboring vertebral pairs into our linear model showed minor improvements above the baseline input of all vertebra, increasing the AUC-ROC from 0.908 to 0.909 and AUC-PR from 0.672 to 0.695 (curves attached in Supplementary Information Fig. SI5 and SI6). Likewise, models based on top maximum vertebral scores, including interactions between maximum scores, produced improvement in AUC-PR but not in AUC-ROC. In the LR model using the top three maximum vertebral fracture scores including interaction terms between maxima, AUC-ROC dropped slightly from 0.960 to 0.958 but brought up AUC-PR from 0.792 to 0.828. The improvement can be most easily seen on the PR curve in Supplementary Information Fig. SI6.

Demographic covariates (age and race/ethnicity) were incorporated into multiple models. In both model classes with all vertebrae as input or in GAMs with multiple maximum vertebral body scores, adding age improved AUC-ROCs slightly but also decreased AUC-PR. While the top-performing model included age as an input, the additional benefit of including age was minimal. Incorporating race/ethnicity decreased performance in both model classes. As the dataset is composed exclusively of males over the age of 65, and mostly white, these covariates may be more informative in other populations.

Top models

The results demonstrate high performance using the top-ranked predicted vertebral body scores, with the maximum vertebral body score producing one of the best models of either class in terms of the largest average area under the receiver operating characteristic or precision-recall curves (AUC-ROC or AUC-PR). The highest AUC-ROC overall was a GAM with the top three maximum predictions and age as inputs, and the performance of the GAMs using the top one to three maximum vertebral bodies with or without age all achieved similar predictive ability. For both model classes, the ranked vertebral body predictions or the maximum prediction by region as inputs outperformed all vertebrae, but were notably lower than the maximum.

The highest average AUC-ROC (0.968) was achieved by the GAM with the top three maximum vertebral fracture scores and age as inputs (Fig. 3). The experimental results for this model across all fivefolds are shown in Fig. 4 and demonstrated similar performance across all test sets. We saw almost identical AUC-ROCs in the GAMs with the highest one to three maximum vertebral fracture scores, either with or without age as an input, and LR with the single maximum vertebra. This is the best-case scenario discussed above, where misclassification error rates are controlled at the subject-level models to nearly match the performance of the original vertebral body fracture prediction model. ROC and especially PR curves are useful here to evaluate or select the best models, as many of the top models achieved nearly identical ROCs but have distinguishable PR curves. This is most notable in the models using top-ranked vertebral fracture scores, where the addition of interaction terms in the LR models showed improved PR curves while age in the GAMs did not. In clinical settings, the ROC and PR curves can also help to determine classification thresholds to achieve desirable error rates.

Discussion

Compared to other medical AI models, we achieved a high predictive performance (AUC-ROC) of 0.97 at the subject level compared to the vertebral level of 0.99 for the use case of osteoporotic vertebral fracture screening. While the upstream vertebral model is highly accurate, baseline subject models simply combining those vertebral scores proved less so (AUC-ROC 0.91). a problem not solved with increased model complexity (AUC-ROC 0.82). The decreased performance of the baseline subject-level classifier compared to the vertebral level demonstrates the concern described in the introduction and a motivation for this work. Simply combining multiple model outputs, even from highly predictive models, propagates error.

Key experimental takeaways are as follows: (1) Using maximal vertebral fracture scores produced superior predictive performance with good subject-level error control, (2) additional performance gains in the GAMs were provided by using the top three vertebral fracture scores, although (3) simple LR using only maximum score performed comparably, and (4) incorporating age and ethnicity had minimal impact.

The improved performance of the maximal vertebral fracture scores compared to using all vertebrae is unsurprising, as a practitioner diagnosing fracture in an individual subject would weigh more heavily the information gained from the most abnormal vertebra for that case. This also simplifies the application and development of the model, as imputation for missing vertebra can be avoided.

Age may have a role in predicting fracture, as experience and data would confirm, but the relationship may be nonlinear and was not particularly strong here after accounting for the maximum vertebral score. The predictive utility of demographic covariates, however, was not adequately evaluated in this study because of the limited diversity of the MrOS dataset (age > 65, male, mostly white). Additional MrOS covariates were not used in model development because clinical data has a paucity of readily available, reliable, high-quality covariates.

Strengths and limitations

This study has additional limitations. The use of multicenter data from a prior clinical trial may not generalize well due to differences in clinical and research imaging techniques, or in acquired images due to new modalities or imaging hardware. Clinical acquisitions are also performed in a variety of settings including inpatient and emergency departments where patients suffering acute illness are less able to cooperate with instructions and there can be superimposed instruments and objects, which were rare in MrOS data. The lack of publicly annotated spinal radiograph datasets limited the scope of this work, but future work could further evaluate performance and demographic predictors on additional, external datasets.

This work targeted moderate to severe fractures, which means that mild fractures may be missed. However, it should be noted that there is often poor clinical agreement on mild fractures and the severity of the fracture does not necessarily correlate with bone mineral density.

We focused on AUC-ROC for model evaluation, but recognize that model selection and thresholding are part of decision-making, which is outside the context of this paper and should incorporate specific population and domain information. For example, partial AUCs may be more clinically useful for model selection by thresholding false-positive rates, and the simplicity and ease of interpretability of LR may be worth minor performance decreases. In the prior vertebral fracture model, metrics were calculated with thresholds to balance PPV and specificity. This is appropriate in a setting where sample prevalence resembles the population and the focus is controlling the false discovery rate. Using the same strategy, the subject model here resulted in 0.624 sensitivity, 0.994 specificity, 0.94 PPV, and 0.955 NPV. Creating downstream models that rely on outputs from other models runs the risk of propagating errors. In the worst case, they may be additive when aggregating multiple sources of error. That is, even with an extremely accurate vertebral body fracture prediction model as input, the subject-level models may not result in comparable performance. Additionally, bias may be introduced upstream, for example subsampling to enrich the ratio of fracture in the training procedure for vertebral fracture scores, which could affect generalizability. To quantify performance most accurately, we used a 15% (unenriched) held-out test dataset, which by necessity limits the data size. Extensions of this work could quantify the subject model improvement with better vertebral body models. These methods of subject-level modeling can also be extended, incorporating additional structured or unstructured data sources like electronic medical records or using outputs from multiple models as a form of ensembling.

This project’s fundamental approach generalizes well in various healthcare AI settings. While many published projects perform a task on a complex piece of data (entire radiograph, entire CT scan, etc.), we show that by breaking the task into smaller units (vertebral patch classification) and achieving good performance on each subcomponent, you can further improve performance by thoughtfully combining predictions for a clinical task. By taking the maximum predicted value, we minimize error from our individual predictions and also facilitate explainable AI results by knowing the region of the radiograph responsible for the prediction of fracture. Supplementary Fig. SI7 shows how the output from this work could be displayed in a manner to facilitate easy interpretation by providers. Future work tying all components together by incorporating an external dataset and decision-making can further quantify pipeline performance, as well as provide insight into model failure cases.

Deep learning in medical imaging is in a nascent state. As the complexity of models and multi-model pipelines grows to solve more difficult image analysis problems, there needs to be nuanced manipulation and a combination of model outputs and covariates to reach useful clinical endpoints. The task of classifying spinal radiographs for osteoporotic compression fracture is challenging even for humans to attempt in a single classification step, and the performance may not be good enough for screening large populations [29]. With thoughtful incorporation of model inputs and covariates, careful analysis, and some clinical insight, error rates for individual steps can be mitigated and overall performance improved, as demonstrated here.

Conclusion

Using the outputs from a deep learning fracture classification of individual vertebral bodies from radiographs and basic demographic data, multiple methods for modeling patient-level outcomes were tested to find the most performant mechanism while keeping errors small. The best strategy demonstrates high performance by classifying patients with fractures. This research draws attention to one of the challenges and mitigation strategies for future clinical AI, where researchers wanting to improve the performance of predictive models break complex predictions into smaller tasks and downstream models must minimize error propagation while using smaller portions of reserved testing data.