Introduction

Osteoporosis may not have clinical manifestations prior to the onset of a fragility fracture, which poses challenges for early prevention and diagnosis [1, 2]. This is of clinical importance because increased bone fragility and osteoporotic fractures may be associated with debilitating consequences such as spinal cord compression [3,4,5], avascular necrosis [6, 7], need for surgical intervention, and increased mortality rates [8,9,10,11]. Pain, disability, and loss of independence are additional noteworthy complications that occur in the setting of osteoporosis and fragility fractures. Women who are undergoing or have recently undergone menopause are at particularly elevated risk of developing osteoporosis [12]. However, several other factors have been implicated in the development of osteoporosis, and a comprehensive understanding of the nuanced interactions of such factors on the magnitude of risk for osteoporosis development has not been well studied [13, 14].

Previous work has demonstrated that differences in serum concentrations of hormones such as estradiol [15, 16], dehydroepiandrosterone (DHEAS) [17-19], and thyroid-stimulating hormone (TSH) [20,21,22] are associated with important differences in risk of osteoporosis or osteoporotic fractures. Recent reports in the literature, however, have suggested that menopause and its associated changes in steroid hormone levels may not make as significant of a contribution to the pathophysiology of the disease as was previously thought [23,24,25]. In fact, periodically updated guidelines from the International Society of Clinical Densitometry have included an increasing number of indications for bone density screening in recent years [26]. This suggests that the etiology of the disease is multifactorial, as risk may depend on individual patient phenotypes, and that the complete array of factors contributing to its pathogenesis is not fully known [27,28,29,30].

Existing prediction models pertaining to this patient demographic, such as the Fracture Risk Algorithm (FRAX) tool for predicting osteoporosis-related fracture risk in postmenopausal women and men aged 50 years and older[1], may serve as important clinical adjuncts in patients with osteoporosis. However, FRAX is not intended for premenopausal women or women younger than 50, two additional patient populations at risk of developing osteoporosis [31, 32]. In addition, the goal of FRAX is to predict fractures; however, a diagnosis of osteoporosis based on low BMD can guide clinical management even in the absence of a fracture. Moreover, patient awareness of their diagnosis likely plays a role in fracture risk and other clinical outcomes; however, there has been little investigation into factors influencing patient awareness. While several studies have demonstrated that machine learning models can outperform conventional clinical tools at predicting osteoporosis in postmenopausal women[33,34,35,36], the included populations have been limited to respondents of the Korea National Health and Nutrition Examination Surveys and a single medical center in northern Taiwan, thereby reducing the generalizability of their findings. Although fracture risk is an important metric when considering osteoporosis management, the likelihood of receiving a clinical diagnosis is a clinically relevant outcome that is also used to guide treatment decisions.

There is limited research on the prediction of long-term risk of osteoporosis in women younger than 50, premenopausal women, and perimenopausal women, especially in the USA. While several known risk factors for osteoporosis exist in clinical practice, a combined predictive model incorporating these factors to provide an individualized assessment of long-term osteoporosis risk does not exist in a Western population. The extent to which these risk factors influence patient awareness of a diagnosis is also unclear. Treatment is generally recommended in postmenopausal women who have a bone density T-score of −2.5 or less, a history of spine or hip fracture, or a FRAX score indicating increased fracture risk [37]. Therefore, a clinical model capable of predicting those outcomes at an earlier stage in the course of the disease may be of clinical benefit.

Given this important gap in the literature and its clinical implications, the purpose of the current study was to perform a predictive analysis integrating a comprehensive profile of risk factors in order to better understand osteoporosis risk. The authors hypothesized that (1) the integrated prediction model would demonstrate good to excellent risk discrimination and (2) long-term risk of osteoporosis in pre- and perimenopausal women would be predicted by clinical risk factors, serum biomarker levels, and bone mineral density (BMD) measurements.

Methods

Study design

This is a secondary analysis of 3302 patients using baseline and 10-year follow-up data from the longitudinal Study of Women’s Health Across the Nation (SWAN) [38]. Between 1996 and 1997, participants joined SWAN through seven designated research centers throughout the USA—Ann Arbor, MI; Boston, MA; Chicago, IL; Alameda and Contra Costa County, CA; Los Angeles, CA; Jersey City, NJ; and Pittsburgh, PA. At the time of enrollment, women were pre- or perimenopausal with an intact uterus and at least one ovary, not taking hormones, and between 42 and 52 years of age. Premenopausal women were defined as those having no change in bleeding patterns; perimenopausal women were defined as a those with a change in length of the bleeding or inter-bleeding interval with at least one occurrence of bleeding in the past 3 months. At each visit, data was recorded via physical measures, fasting morning blood draw, and both interview- and self-administered questionnaires.

BMD of the posterior-anterior lumbar spine and total hip was measured by dual x-ray absorptiometry (DXA) using either a Hologic QDR 2000 densitometer (Hologic Inc., Waltham, MA) or a Hologic QDR 4500A [39]. All centers employed a standard quality control program that included daily measurement of a Hologic anthropomorphic spine phantom at each site, cross-site calibration with a single anthropomorphic spine phantom, visual review of every scan image by a local site investigator experienced in bone densitometry, and random review of 5% of scans plus all problem scans by Synarc, Inc. (Waltham, MA). Measurements of the local spine phantoms and the circulating spine phantom were analyzed by Synarc and used to adjust DXA measurements for minor temporal or geographic variations. Using standard reference values for the total hip, these measurements were then converted to T-scores [40]. Determination of the presence of an osteoporosis diagnosis was based on the participants’ answers to the question, “Has a doctor, nurse practitioner or other health care provider told you that you had osteoporosis (brittle or thinning bones) or treated you for osteoporosis?”

Each longitudinal follow-up wave included participants from the initial wave. For this analysis, a total of 1,617 participants had incomplete baseline data or did not have 10-year follow-up and were excluded. Demographic data and values of risk factors for the remaining 1,685 participants are shown in Table 1. Additionally, this data is reported for the 1617 who were excluded from the study and compared with that of those who were included (Online Table 1 in the Supplementary Information).

Table 1 Values of each of the demographic characteristics and risk factors included in the model for those not diagnosed and those diagnosed with osteoporosis at 10-year follow-up. Data are expressed as mean ± SD unless otherwise noted. P-values were calculated via Student’s t-test, Pearson’s chi-squared test, or Fisher’s exact test with α set to 0.05

Statistical analysis

Of the 1685 women who met the inclusion criteria, 337 (20%) were randomly selected as a holdout set for model evaluation, while the other 1348 (80%) were used to train the predictive model. A generalized additive model with pairwise interactions (GA2M) was implemented in Python using the Explainable Boosting Machine, a visually interpretable machine learning model from the open-source InterpretML package [41, 42]. The primary outcome of interest was whether participants reported they had been told they had a diagnosis of osteoporosis by a doctor or other healthcare provider at 10-year follow-up. For comparison, we also ran a logistic regression.

Based on previous literature, we included 14 risk factors available in the SWAN dataset at baseline: age, height, weight, body mass index (BMI), waist circumference, race, menopausal status, maternal osteoporosis history, maternal spine fracture history, serum estradiol level, serum DHEAS level, serum TSH level, total spine BMD, and total hip BMD. To maximize validity, the GA2M also considered six pairwise interaction effects among the 14 risk factors [41]. Summary statistics were reported for all variables and compared via Student’s t-test, Pearson’s chi-squared test, or Fisher’s exact test with α set to 0.05 (Table 1).

Importance scores for each risk factor in the Explainable Boosting Machine were calculated as the “logit,” or log-odds ratio (OR), averaged across all observations in the training dataset (Table 2) [41, 43]. Therefore, a greater importance score corresponds to a larger contribution in predicting risk of diagnosis. Age- and BMI-adjusted odds ratios calculated using logistic regression are also reported in Table 2. Performance of the model was assessed using area under the receiver operating characteristic curve (AUC), Brier score, sensitivity, specificity, accuracy, and kappa. Univariate and pairwise partial dependence plots were created to visualize how, if at all, each risk factor (and pairwise interaction of factors) was related to osteoporosis risk. AUC values between variations of the model were compared via paired t-test.

Table 2 Importance scores and age- and BMI-adjusted odds ratios of each predictor as computed by GA2M and logistic regression, respectively

Two discrimination thresholds for predicting a diagnosis—one at which Youden’s J index was maximized and one at which kappa was maximized—were calculated using an exhaustive search and then assessed. Specifically, a sequential search algorithm was implemented across thresholds between 0 and 1 to identify the threshold values at which both of these predictive performance metrics reached a global maximum. These thresholds were used to separate women in the evaluation dataset into three categories (low, medium, and high) based on predicted risk. Confidence intervals (CIs) for AUC and Brier score were calculated using bootstrapping by resampling the validation dataset 200 times with replacement. A calibration curve[44] was also plotted using Python.

Results

Risk factor analysis

At 10-year follow-up, 113 (6.7%) of the 1685 women reported that they had been diagnosed with osteoporosis. Compared to those who were not diagnosed, women who reported diagnosis were, on average, older by 1.1 years (P < 0.0001), shorter in height by 1.8 cm (P = 0.0045), weighed 7.6 kg less (P < 0.0001), had a lower BMI by 2.2 kg/m2 (P = 0.0006), were more likely to be perimenopausal (P = 0.0113), and were more likely to have a mother who was diagnosed with osteoporosis (P = 0.0146). In addition, those who were diagnosed had lower total spine BMD measurements by 0.13 g/cm3 (P < 0.0001) and lower total hip BMD measurements by 0.11 g/cm3 (P < 0.0001) at baseline (Table 1). Importance scores for each of the 14 risk factors and six pairwise interaction effects, as well as age- and BMI-adjusted odds ratios calculated using logistic regression, are shown in Table 2.

For each of the six pairwise interactions that were selected by the GA2M, a partial dependence plot was created to visualize how the given combination of variables was incorporated into the model as a predictor of 10-year osteoporosis risk (Online Figure 1 in the Supplementary Information). Each plot represents the change in predicted risk as a function of a pairwise combination of risk factors. Total spine BMD was a member of five of the six pairs; total hip BMD and age were both members of two of the six pairs.

Osteoporosis risk prediction and stratification

On the 337 observations that were initially held out for evaluation of model performance, the GA2M predicted 10-year risk of osteoporosis with an AUC of 0.83 (95% CI, 0.73–0.91) (Fig. 1). Brier score for the model was 0.054 (95% CI, 0.035–0.074).

Fig. 1
figure 1

Receiver operating characteristic curve

Using the same predictors and outcome measure, a logistic regression model had an AUC of 0.62 (95% CI, 0.49–0.76) and Brier score of 0.057 (95% CI, 0.039–0.079). This AUC was significantly less than that of the GA2M (P = 0.0115). Furthermore, to assess the degree to which baseline BMD and age contributed to predictive performance alone, two additional models were developed and tested. In the first, BMD at both the hip and the spine, as well as age, were excluded from the predictors. This model had an AUC of 0.70 (95% CI, 0.60–0.80) and a Brier score of 0.057 (95% CI, 0.038–0.079). The AUC was not significantly less than that of the full GA2M (P = 0.082). In the second, hip and spine BMD and age were the only predictors included. This model had an AUC of 0.82 (95% CI, 0.71–0.90) and a Brier score of 0.054 (95% CI, 0.036-0.074). The AUC was not significantly less than that of the full GA2M (P = 0.822).

Two discrimination thresholds for predicted probability of diagnosis (pdx) were calculated: the threshold at which Youden’s J index was maximized, which was determined to be pdx = 0.063, and the threshold at which kappa was maximized, which was determined to be pdx = 0.140. Using pdx = 0.063, sensitivity was 0.81, specificity 0.82, accuracy 82%, and kappa 0.28. Using pdx = 0.140, sensitivity was 0.43, specificity 0.94, accuracy 91%, and kappa 0.31.

These two thresholds established three risk-stratified categories of women according to predicted 10-year risk of osteoporosis (Fig. 2). In the lowest-risk category were those with pdx ≤ 0.063 (n = 262), 4 of whom (1.5%) were diagnosed with osteoporosis at 10-year follow-up (likelihood ratio [LR] of 0.23). In the medium-risk category were those with 0.063 < pdx ≤ 0.140 (n = 46), 8 of whom (17.4%) were ultimately diagnosed (LR of 3.2). In the high-risk category were those with pdx > 0.140 (n = 29), 9 of whom (31.0%) were ultimately diagnosed (LR of 6.8). Across the three risk categories, observed rates of osteoporosis diagnosis were determined to be significantly different via Pearson’s chi-squared test (P < 0.0001).

Fig. 2
figure 2

Histogram showing the distribution of 10-year osteoporosis risk predictions made by the model at baseline, categorized by follow-up outcome. Frequencies are represented using a logarithmic scale on the y-axis. Dotted lines indicate stratifications of low-, medium-, and high-risk individuals

Calibration of prediction model

A calibration curve was plotted to visualize the relationship between predicted risk of diagnosis at baseline and actual diagnosis rates at follow-up in each of the ten risk-stratified deciles of the evaluation dataset (Fig. 3). R2 for the curve was 0.81 based on a least-squares regression, which yielded a line of best fit with a slope of 1.32 and y-intercept of −0.02. In four of the ten deciles, the GA2M overestimated risk of osteoporosis. In the other six, the GA2M underestimated risk. Importantly, of the ten deciles, only one—the ninth—had a mean pdx that was more than one standard error away from the observed rate.

Fig. 3
figure 3

Calibration curve. Each plotted point represents 10 percent of the evaluation dataset (n = 337)

A decision curve is shown in Online Figure 2 in the Supplementary Information.

Discussion

The main finding of the current study was that a novel clinical prediction model integrating clinical risk factors, serum biomarker levels, and BMD measurements demonstrated good performance across several metrics for predicting the 10-year likelihood of having been told of a diagnosis of osteoporosis. Of the 14 risk factors included in this analysis, total spine BMD and total hip BMD were the two features with the highest predictive power in estimating 10-year risk of osteoporosis. However, several other factors were found to be of predictive importance, including those pertaining to demographics, physical measurements, and metabolic profiles. Finally, we implicated several discrimination thresholds that reliability classified patients into low-, medium- and high-risk categories for osteoporosis development. With further validation of this model, it may be possible for clinicians to diagnose osteoporosis earlier in the course of the disease and potentially reduce adverse events associated with osteoporosis by identifying those at greatest long-term risk and subsequently implementing prophylactic strategies.

As independent risk factors, total spine BMD (importance score = 0.197) and total hip BMD (importance score = 0.162) were the two top-ranked factors in predicting 10-year risk of osteoporosis (Table 2). These importance scores, expressed in logits, had corresponding age- and BMI-adjusted ORs of 0.42 (95% CI, 0.35–0.51) per 0.1 g/cm2 increase and 0.43 (95% CI, 0.35–0.53) per 0.1 g/cm2 increase, respectively. This finding is in accordance with previously published analyses of longitudinal data from SWAN, which have demonstrated that during this 10-year observation period, BMD decreased by between 8.5 and 11.2% at both the lumbar spine and femoral neck [45,46,47,48]. The machine learning-based analysis we present in this paper extends the existing literature by quantifying the increase in risk of a clinical diagnosis of osteoporosis associated with bone loss in this population. Notably, while we found that spine BMD demonstrated a higher importance score than did hip BMD, only T-score from the femoral neck is included as input to the FRAX model, possibly limiting its predictive ability. Furthermore, while knowledge of a low baseline BMD would hold the potential to influence diagnosis awareness in this study population, this is not a likely source of bias in the current study, as the raw BMD measurements required cross-calibration and quality control for validation, and only 9 (0.5%) of the included participants had baseline T-scores in the osteoporotic range. This distinction is important to consider, as low baseline BMD may be a significant independent risk factor for osteoporosis and/or fractures irrespective of the rate of bone loss.

After BMD, the third- and fourth-ranked risk factors in our model were age (importance score = 0.121) and menopausal status (importance score = 0.095), corresponding to ORs of 1.32 and 1.25, respectively (Table 2). This finding corroborates previously published work suggesting that osteoporosis is associated with aging independently of its association with menopause [11, 23,24,25,26, 30, 49]. Additionally, age exhibited pairwise interaction effects with both spine BMD and hip BMD: therefore, in women with BMD measurements indicative of inferior bone quality, 10-year risk of osteoporosis was higher at a younger age and earlier menopausal status (Online Figures 1D and 1F in the Supplementary Information). Notably, while a model with BMD and age alone does not perform significantly worse than the model with all 14 predictors (BMD and age included), a separate model composed exclusively of the other 11 factors demonstrated good predictive performance. This result suggests that some combination of these 11 predictors might be clinically meaningful in predicting long-term diagnosis, and their marginal predictive value when incorporated into a model based on BMD and age alone would likely become more apparent in a larger cohort. However, further work is needed to evaluate the clinical importance of their contribution to risk assessment. Even in the absence of such work, these findings are nevertheless pertinent for clinical consideration, given the lack of indications for routine BMD screening in this age group. In addition, compared to DXA scans, it is less expensive and less invasive to obtain the remainder of these patient factors, as they require only taking a complete history, performing a physical examination, and collecting bloodwork.

This modeling technique highlighted other pairwise interaction effects as well. The interaction between spine BMD and weight had the highest importance score among all pairwise interactions at 0.052, corresponding to an OR of 1.13 (Online Figure 1A in the Supplementary Information). Overall, the combination of low spine BMD and low bodyweight at baseline conferred a particularly elevated 10-year risk of osteoporosis. However, this relationship was nonlinear with respect to bodyweight: for those at the lower extreme of spine BMD, a weight at the upper extreme was associated with a greater 10-year risk than an intermediate weight. This suggests that those at the extremes of bodyweight (i.e., both underweight and overweight) may be at elevated risk of osteoporosis. These findings might be explained by the effects of the increased release of adipokines in patients with obesity on bone quality[50, 51] or lifestyle factors (e.g., diet and exercise) that concomitantly contribute to the pathogenesis of osteoporosis [52]. Given that BMD is both one of the inputs and a key component of the outcome variable, there does exist some degree of inherent circularity within the model; however, we believe that this model still yields several important insights into osteoporosis risk.

While the current standard of care is to perform DXA screening in women aged 65 or older without risk factors for osteoporosis, the findings in the current study suggest several patient risk profiles in which it may be beneficial to initiate screening at an earlier age. Of the 113 women in this dataset who were ultimately diagnosed with osteoporosis, the mean age at baseline, when the DXA scans were captured, was 46.9 years. Even at that age, which is 18 years younger than guidelines call for screening to take place, the women exhibited significantly decreased BMD in both the spine (0.95 versus 1.08 g/cm2, P < 0.0001) and the hip (0.85 versus 0.96 g/cm2, P < 0.0001) compared to those who were not ultimately diagnosed. While some of those women had risk factors that would otherwise justify early screening, for those who did not, adherence to the current guidelines would have precluded a physician from taking advantage of data that could be used to diagnosis osteoporosis at an earlier age. By the age of 65, regardless of symptoms and risk factors, it may be too late to intervene effectively.

Using the two discrimination thresholds calculated to maximize Youden’s J index and kappa, respectively pdx = 0.063 and pdx = 0.140, this analysis established a systematic approach to classify women as low risk, medium risk, or high risk based on estimated probability of osteoporosis diagnosis within 10 years (Fig. 3). On the evaluation dataset, 1.5% of the low-risk, 17.4% of the medium-risk, and 31.0% of the high-risk women were ultimately diagnosed at follow-up. While stratification into the high-risk category yielded a sensitivity of only 0.43, it was highly specific, yielding a specificity of 0.94. Even at the more aggressive pdx of 0.063, where sensitivity increased to 0.81, specificity only fell to 0.82. This suggests that women in the high- and medium-risk groups might be candidates for prophylactic strategies against osteoporosis. These strategies might include conservative management, such as educational programs or protective equipment; medical management, such as early administration of bisphosphonates or monoclonal antibodies; or even novel surgical techniques.

We believe the relatively small size of the high-risk group is acceptable, because in a clinical setting this highest-risk designation should be adequately specific in order to minimize unwarranted treatment of women in whom the development of osteoporosis is not certain or almost certain, as treatment has its own downsides. This group may be relatively smaller due to the relative paucity of women with the extensive combinations of risk factors seen in the high-risk group. Given the numerous unmeasured and immeasurable factors that influence risk, it may be difficult for this model to identify members of this high-risk group with high specificity, especially given the limited amount of available training data.

At an even finer level of granularity, a calibration curve divided the evaluation data into ten deciles based on predicted osteoporosis risk (Fig. 3). This curve demonstrated a strong correlation between predicted risk and observed risk, with R2 = 0.81, an indicator of robust model validity. In the future, a regression model might be used to predict follow-up BMD as a continuous outcome, rather than the classification model presented here for predicting a binary outcome.

This analysis has limitations. First, for the purpose of this study, time to diagnosis was not considered in this model; however, this is an important factor in predicting fracture risk. Further work is required to validate this model on external datasets, as discussed above. However, the trained model has excellent internal validity, demonstrated by its performance on an evaluation dataset to which it was naïve during training. This is indicated by several performance metrics including AUC, Brier score, sensitivity, specificity, accuracy, kappa, and calibration curve analysis. Second, the model also designates as its outcome variable a self-reported osteoporosis diagnosis, which may fail to capture a population of patients with poor awareness of their condition. Further work is warranted to investigate underlying risk factors for and causes of lack of patient awareness of a diagnosis in the presence of low BMD. Despite its subjectivity, we contend that a strength of this self-reported questionnaire response is its consideration of the multifactorial nature of osteoporosis, which is a complex pathological process. Moreover, given stark differences in biological processes including aging and metabolism, it is reasonable to believe that in populations characterized by different demographic features, it would be necessary to develop a separate model with a different set of risk factors. Third, because of the longitudinal design of SWAN, which lacks a randomization protocol, it is not possible to establish causative relationships between risk factors and predicted 10-year probability of osteoporosis. However, most of the predictors included in this model are unmodifiable and knowing how each relates to risk of osteoporosis would nevertheless be quite influential in motivating a clinical approach. Fourth, limiting the analysis to the 1685 women with 10-year follow-up does not allow for the consideration of validity for those without 10-year follow-up, who may exhibit different risk profiles.. Fifth, this dataset does not include BMD for the femoral neck in isolation, which is incorporated into the FRAX model and commonly utilized when performing BMD assessments. In the absence of complete data, it is impossible for the model to account for this fact. Finally, the exclusion of approximately half of the original participants based on data availability may have biased results, as those who died might have had lower BMD given that low BMD is a predictor of mortality and loss to follow-up is unlikely to be completely random. However, no differences were observed in hip or spine BMD, nor in prevalence of having a baseline T-score below −2.5, between those ultimately included and those excluded (Online Table 1 in the Supplementary Information).

The model developed in this analysis integrates clinical data, serum biomarker levels, and bone mineral densities to predict 10-year risk of osteoporosis with good performance. This statistical modeling technique highlighted the importance of baseline BMD measurements, especially those of the spine, as long-term predictors of osteoporosis risk, and had superior discriminative ability to that of a logistic regression model. However, the current data demonstrates that BMD alone does not entirely determine risk, as numerous other factors including demographics and metabolic profiles contribute to predicted probabilities of diagnosis, both independently and as parts of pairwise interaction effects. Given the powerful discriminative ability of this model, physicians may reliably stratify pre- and perimenopausal women into categories based on predicted osteoporosis risk and determine potential target populations for the investigation of preventative strategies during earlier stages of the disease. Contextualized alongside the clinically important assessment of fracture risk, these risk assessments add novel and meaningful information regarding a distinct clinical endpoint. While further research and validation will be important to assess performance and generalizability outside of this cohort, this clinical prediction model effectively integrates multiple categories of risk factors, highlighting its potential to facilitate earlier identification for more robust prevention and possible interventions in the management of osteoporosis.