Introduction

Osteoporotic vertebral fractures (OVFs) comprise a major health concern for the elderly population [1]—they degrade the quality of life and reduce life expectancy [2, 3]. Although most OVFs heal within a few months [4], some exhibit nonunion even after conservative treatment, including osteoporotic medication, bed rest, and braces [4]. Nonunion can lead to severe back pain and neurological deficits [5, 6], often requiring surgical intervention [7]. As the efficacy of surgical intervention is higher when it is administered early [8], early nonunion prediction is essential for effective treatment. Several risk factors for nonunion have been reported in the literature [5, 9,10,11,12,13]—in particular, magnetic resonance imaging (MRI) has been extensively utilized for prediction. However, there is still room for improvement in conventional methods for nonunion prediction.

Recently, machine learning (ML) has been widely incorporated in prediction models [14, 15]. The most widely used ML methods are supervised learning methods [14], which search for algorithms that produce general hypotheses based on externally supplied instances and output predictions for future instances [15]. Accurate disease outcome prediction is one of the most interesting and challenging tasks for physicians. ML-based techniques can be used to discover and identify patterns and relationships between them by analyzing complex datasets [16]. Consequently, ML methods have become popular tools among medical researchers. There are several ML methods, such as general linear regression models, decision trees, random forests (RFs), and extreme gradient boosting (XGBoost) [17, 18]. In this study, we hypothesized that these models would improve the prediction accuracy of conventional nonunion prediction models. The purpose of this study is to create an ML-based nonunion prediction model.

Materials and methods

Study design

We consider datasets based on two previously conducted multicenter cohort studies[4, 19] (Fig. 1). The first cohort study was conducted to investigate the risk factors for poor prognosis in OVF patients at 25 institutions between 2005 and 2009. The second cohort study was conducted to investigate changes in MRI observations six months after the OVF onset at 11 institutions between 2012 and 2015. The inclusion and exclusion criteria, treatment protocols, and MRI protocols of the two cohort studies were identical, as described here.

Fig. 1
figure 1

Study flow chart

Written informed consent was obtained from all study participants. Each study protocol was approved by the Institutional Review Board of the representative institution.

Patients

Details of the patients are described in our previous studies [4, 19]. The inclusion criteria were age greater than 65 years and diagnosis of fresh OVF. The exclusion criteria were pathological fractures, two or more new fractures, malignant cancer, dementia, and high-energy injuries. In the first cohort study, 485 patients were enrolled. Among them, 15 died, six underwent surgery, 11 were excluded because of other diseases, and 33 were lost during the follow-up procedure. Consequently, 420 patients completed the six-month follow-up. Finally, 352 patients (72.6% follow-up rate) with the required data, including examinations and completed questionnaires at both the time of enrollment and the six-month follow-up, were analyzed in the present study. In the second prospective multicenter cohort study, 218 patients with symptomatic OVFs were eligible for participation. Among them, three died, eight were excluded because of other diseases, and 54 were lost during the follow-up period. As a result, 153 patients (125 females and 28 males) completed the six-month follow-up (70.2% follow-up rate) and were analyzed in the present study.

At the time of enrollment, the severity of pain was subjectively assessed using the Visual Analogue Scale (VAS), which identified the average level of back pain experienced by patients over the week preceding the study.

Imaging assessment

At the time of enrollment and the six-month follow-up, patients were examined using plain radiography. The kyphotic angle of the fractured vertebrae was determined using the segmental angle in weight-bearing positions. The relative height of the anterior wall (%) of the affected vertebra was calculated as follows: [2 × affected vertebral height/(lower vertebral height + upper vertebral height)] × 100[4]. Bone mineral density was measured at the toe and hip in the first and second cohorts, respectively.

MRI

The patients were examined using MRI at the time of enrollment and the six-month follow-up. The signal change patterns within the fractured vertebral bodies identified via MRI were classified based on midsagittal and bilateral parasagittal (medial aspect of pedicles) T1WI and T2WI. Signal change patterns on T1WI were classified into three categories—diffuse low, confined low, and no signal change—whereas those on T2WI were classified into four categories—high (Fig. 2A), confined low, diffuse low (Fig. 2B), and no signal change. The intensity of the high signal changes on T2WI was defined to be similar to that of the cerebrospinal fluid. If at least two of the three slices exhibited a low signal change over more than 50% of the area of the vertebral body, the observation was considered to be a diffuse low signal change. With respect to intra- and inter-reliability, the kappa values of the MRI observations were in the range of 0.712–0.942 [4].

Fig. 2
figure 2

MRI observations in the acute phase (A) High-signal change on T2 weighted image was observed at the fractured vertebral body. (B) Diffuse low-signal change on T2 weighted image was observed at the fractured vertebral body

Nonunion definition

Plain radiographs were assessed based on comparisons of lateral profiles between supine and weight-bearing positions because intravertebral cleft can be easily identified in the supine position. Nonunion was defined as a recognizable intravertebral cleft and apparent segmental motion on dynamic plain radiographs (difference greater than 5° between supine and weight-bearing positions) observed at the six-month follow-up.

Treatment strategy

Treatment with braces was continued for 2–3 months, with soft and hard braces prescribed to 60 and 40% of the patients, respectively. The patients were allowed to maneuver into an erect posture when the brace was applied. Additionally, patients were prescribed anti-osteoporotic and pain-relief medications according to their individual status.

Prediction by the surgeon

Nonunion was predicted based on MRI observations. T2 high or T2 diffuse low-signal change was previously reported as a predictive factor for nonunion [5, 19, 20].

Machine learning

Four ML models—logistic regression, decision tree, XGBoost, and RF—were utilized for nonunion prediction following OVF (Fig. 1). A Python (v. 3.7.6) library called scikit-learn (v. 0.22) was used for this purpose. The specific parameters of the ML models are depicted in Fig. 1. The scikit-learn package was used for logistic regression, decision trees, and RF. XGBoost was used as the gradient boosting machine, which features split finding algorithms that handle sparse data with node default directions, address weighted data using merge and prune operations, and efficiently enumerate over all possible splits for splitting threshold optimization [17, 21].

In aggregate, 17 explanatory variables, including the variables listed in Table 1, were used in the ML models. Before developing the prediction models, the collected data were divided into two categories—70% were allocated to the training dataset and 30% to the test dataset. The cases in the training dataset were used to develop ML and logistic regression models. The cases in the testing dataset were used to validate and compare the performance of the developed models. Each ML method had its own set of hyperparameters—such as the number of layers in XGBoost or the number of trees in RF. Five-fold cross-validation was used to determine the optimal hyperparameters. This cross-validation process was used only for model development; the performance of the final models was evaluated using the testing dataset.

Table 1 Comparison of baseline data between nonunion and union groups

Recursive feature elimination (RFE), known as wrapper feature selection, employs ML models for the computation of relevance scores of features [22,23,24]. It first trains a model with an entire feature set and computes a relevance score for each feature. In the next step, the feature with the least relevance score is neglected, and the model is retrained to compute the relevance scores of new features. This process is continued until the desired number of features remains in the feature set.

Performance evaluation

As mentioned previously, the data were randomly divided into two groups—a training group (used to train the model) and a testing group (used to evaluate the model’s generalization ability)—with relative proportions of 70 and 30%, respectively. The cases in the testing dataset were used to validate and compare the performances of the developed models. Optimal hyperparameter values were identified using five-fold cross-validation to avoid overfitting during model development. The cross-validation process was used only during model development, and the performance of the final models was evaluated using the testing dataset. To identify the most important features for nonunion prediction, XGBoost and RF were used to rank the variables in terms of importance. SHAP (SHapley Additive exPlanation) interpretation was used (model-agnostic) to compute feature importance using RF [25]. We trained XGBoost using all generator features and then obtained the feature importance scores (F-score) [26]. The top 10 variables present in all four ML algorithms were considered during the subsequent stages of model development. The area under the receiver operating characteristic curve (AUC) was used as the performance metric for each model—higher AUC values correspond to better predictive performances. The AUC results were considered excellent for AUC values between 0.9 and 1, good for AUC values between 0.8 and 0.9, fair for AUC values between 0.7 and 0.8, poor for AUC values between 0.6 and 0.7, and failed for AUC values between 0.5 and 0.6 [27]. In addition, accuracy was used as a performance quality metric for the ML classifiers.

Data comparison

The chi-square test or Fisher’s exact test was used to compare categorical variables. The t-test was used to compare continuous variables. Statistical significance was taken to correspond to p < 0.05. All p values were two-sided.

Results

Comparison between the nonunion and union groups

Table 1 presents a comparison between the nonunion and union groups in terms of demographic, clinical, and radiological data. Age was observed to have higher values in the nonunion group than in the union group (78.4 vs. 76.4 years, p = 0.012). The nonunion group exhibited higher weight than the union group (52.7 vs. 50.6 kg, p = 0.046). No significant differences in sex, smoking habits, history of oral steroid use, old OVF, and VAS score for LBP at enrollment were noticed between the two groups. In the nonunion group, OVFs were more frequent at the thoracolumbar level (88.2% vs. 71.2%, p = 0.002). In terms of MRI observations, diffuse low-signal change in T1 weighted image was more frequently observed in the nonunion group than in the union group (76.3% vs. 63.1%, p = 0.031). On T2 weighted images, diffuse low- and high-signal changes were more frequently observed in the nonunion group than in the union group (31.8% vs. 14.5% and 37.7% vs. 6.7%, respectively, p < 0.001). Posterior wall injury was more frequent in the nonunion group than in the union group (56.5% vs. 29.3%, p < 0.001). The kyphotic angle and anterior height ratio were more severe in the nonunion group than in the union group (13.0° vs. 9.2°, p < 0.001 and 75.8% vs. 84.5%, p < 0.001, respectively). The T-scores of bone mineral density exhibited no apparent differences. Moreover, there was no difference in nonunion incidence between the first and second cohorts.

Performance evaluation of machine learning algorithms

Table 2 presents the results obtained using the logistic regression model for nonunion incidence. Greater age, lower anterior height ratio, and T2 WI observations, including confined high and diffuse low-signal changes, exhibited a significant increase in adjusted ORs. Figure 3 depicts the results obtained using a decision tree with a depth of 3. Among the 13 patients with T2 high-signal change, VAS ≥ 9 cm, and age ≥ 72 years, 12 exhibited nonunion at the final follow-up consultation.

Table 2 Result of multivariate logistic regression model for nonunion
Fig. 3
figure 3

Decision tree for this study, which aids the analysis to yield the best prediction. Classification results for nonunion and union are highlighted in blue and orange, respectively. Smaller values of the Gini index correspond to darker colors. BMI: Body mass index; VAS: Visual analog scale

ML-based systems, including RF and XGBoost, used the following variables: age, sex, VAS, smoking habit, MRI T2 weighted image, posterior wall injury, old OVF, kyphotic angle, steroid use, BMI, anterior wall height, and fracture level. Table 3 presents the optimal hyperparameter values and the performances of the ML classifiers. The ranking of features in terms of importance for the RF and XGBoost models is depicted in Figs. 4A and B, respectively. Table 3 and Fig. 5 present a comparison of the test AUC for nonunion prediction using the proposed models. RF and XGBoost exhibited good prediction accuracy, while logistic regression and decision tree exhibited fair results. RF exhibited the highest AUC (0.860, 95% CI: 0.784–0.916), followed by XGBoost (0.845, 95% CI 0.768–0.905). In both models, T2 WI observation was the most important feature (Fig. 4A and 4B,). Additionally, anterior height ratio, kyphotic angle, BMI, VAS, age, posterior wall injury, fracture level, and smoking habit were ranked as the top risk factors.

Table 3 Optimal values of hyperparameters and comparison of area under the receiver operating characteristic curve among the different models during the prediction of nonunion
Fig. 4
figure 4

Variance importance plot. (A) Variance importance plot for the RF model. (B) Variance importance plot for the XGBoost model. T2WI: T2 weighted image; BMI: Body mass index; VAS: Visual analog scale

Fig. 5
figure 5

Areas under the receiver operating characteristic curve for nonunion prediction. AUC: area under the curve

Discussion

This study presents the first attempt to predict nonunion in OVFs based on ML. Evaluation of risk factors of nonunion following OVF onset is essential for efficient treatment, including early surgical intervention [8]. This study compared the predictive abilities of three ML models and a logistic regression model for nonunion following OVF onset. The results demonstrated that RF exhibited the largest AUC and the highest accuracy in nonunion prediction. The XGBoost model also exhibited good AUC and accuracy. In comparison, we previously reported that the prediction accuracy of spine surgeons based on MRI observations was 75.5%–79.1% using this dataset [5]. Therefore, ML-based models remarkably improved the accuracy compared to human prediction.

ML is not limited by parametric or model-based assumptions that may be difficult to determine a priori. Therefore, ML-based approaches have been used to detect patterns in data to automate complex tasks or make predictions and offer different advantages [28]. Especially, ensemble learning methods comprise popular and powerful ML tools for multivariate regression and classification problems that are used to predict nonlinear relationships [29, 30]. RF constituents of the ensemble are tree-structured predictors, each of which is constructed using an injection of randomness [28]. The XGBoost algorithm balances the influence of each tree by adding weights to reduce overfitting using only a random subset of descriptors during the construction of a tree [31]. The main idea behind the ensemble methodology is to aggregate multiple weighted models to obtain a combined model that outperforms every single model in it [32]. Our results demonstrate that these algorithms are effective and efficient ML methods.

The incidence of nonunion in OVFs was reported to range between 13.5 and 19.6% [5, 12, 19]. Patients with nonunion in OVFs generally present functional deterioration, severe pain, and limited mobility. A systematic review [33] reported that the current evidence for using orthotic devices in patients with OVF was inconsistent and of limited quality; therefore, clinicians should undertake careful consideration before prescribing them in practice. Two randomized controlled trials [34, 35] demonstrated no apparent effect of braces on the treatment of OVFs. Some patients required surgical intervention. Early intervention was reported to lead to better clinical and radiological outcomes of vertebral augmentation [8, 36] because the vertebral height of the anterior wall progressed steadily from incidence during the six months of follow-up in patients with non-union [5]. Therefore, early nonunion prediction is essential to improve treatment strategies for OVFs.

The present study established MRI findings, anterior height ratio, kyphotic angle, BMI, VAS, age, posterior wall injury, fracture level, and smoking habit as risk factors in the RF and XGBoost models with high accuracy. Several risk factors of nonunion, e.g., posterior wall injury, instability of fractured vertebrae, and MRI observations depicting high-signal changes on T2WI, diffuse low-signal changes on T2WI, and diffuse low-signal changes on T1WI, have been reported. Our previous studies [5, 19] showed that diffuse low- and high-signal changes on T2 MRI were correlated with an increased risk of nonunion in the acute phase. In addition, other studies [12, 20] demonstrated that MRI revealed the association of delayed union with T1 and T2 diffuse low-signal changes and T2 high-signal changes in the acute phase. Posterior wall injury and anterior vertebral height are also risk factors for nonunion [5, 12, 19]. Inose et al. [12] demonstrated that the degree of pain is also an important predictor of nonunion. Although instability evaluation may be an important step in instability prediction, approximately 30% of patients have been reported to be incapable of undergoing dynamic radiography because of severe pain [5]. However, bone mineral density was not reported as a risk factor for nonunion. Therefore, dynamic radiography results in the acute phase and bone mineral density were not included in this model. The performance of the proposed RF and XGBoost models demonstrated that the T2WI observation was the most important feature, confirming the reports of previous studies [5, 12, 19, 20]. RF demonstrated that the degree of compression and pain VAS score were important features. Additionally, obesity and smoking were identified as important features in the proposed RF and XGBoost models. Smoking and obesity have been reported as significant risk factors for nonunion in long bone patients [37]. Smoking can lead to local vasoconstriction, and the high affinity of carbon monoxide to hemoglobin can decrease the total amount of oxygen transported to the healing site [37, 38]. The literature on obesity and fracture healing is similarly inconclusive, with some authors reporting no correlation and others reporting an increased nonunion risk [32]. Obesity can include several conditions, including diabetes, vascular disease, and fracture load, which suggest that obesity is currently not a reliable model of impaired fracture healing [32]. In the current study, information regarding diabetes and vascular disease was not included in the ML models.

This study suffers from certain limitations. Osteoporosis treatment was not included in the ML-based models. However, there have been no reports revealing the effect of osteoporosis agents on the bone healing process in OVF. Further, owing to the small sample size, the data included in the training set for validation were insufficient. Because of this limitation, we did not search for a better ML model; rather, we trained the considered model using various hyperparameters in a predetermined manner. Moreover, the models are only valid corresponding to data collected using the measurement equipment used in this study. Therefore, further research is necessary to generalize the model to higher-dimensional data.

In conclusion, the ML-based system developed in this study based on MRI observations is an effective tool for nonunion prediction following OVFs. The proposed RF and XGBoost models exhibited good performance. Further research is warranted to increase the sample size and apply the proposed ML system to other populations.