Introduction

Osteoporosis is characterized by decreased bone strength and increased fracture risk resulting from low bone mass and deterioration of bone microarchitecture [1]. Sequelae of osteoporosis-related fractures result in significant individual-level health and quality of life-related impairment, and represent a significant system-level financial burden [2].

Low bone mineral density has been associated with an increased risk of osteoporotic fractures; however, algorithms combining clinical risk factors (CRFs) and bone mineral density (BMD) may estimate individual-level fracture risk more accurately, and guide the identification of high-risk individuals who may benefit most from treatment [3]. Prior fractures, falls, advanced age, bone loss, smoking and high alcohol intake represent some of the CRFs that have been associated with increased fracture risk [4, 5]. Individualized estimation of absolute fracture risk to guide decision-making regarding treatment initiation is recommended internationally.

The Fracture Risk Brussels Epidemiological Enquiry (FRISBEE) study is an ongoing population-based cohort study involving 3560 post-menopausal women aged 60 to 85 years from Brussels, Belgium, recruited between July 2007 and June 2013. Three models for fracture risk prediction were recently developed using data from the FRISBEE study: one, for major osteoporotic fracture (MOF) prediction (similar to FRAX); two, for all fracture prediction (similar to the Garvan FRC); and three, for central fracture prediction. Models were developed using a sub-distribution regression method, accounting for competing risk of mortality. In addition to total hip BMD and spine BMD, the MOF prediction model retained three CRFs from FRAX (age, history of fracture, high alcohol intake), and the all fracture prediction model retained all Garvan FRC CRFs (age, total hip BMD, history of fracture, history of recent fall). Central fractures have been associated with higher fracture risk relative to MOFs [6, 7]; the original prediction model that was developed included five CRFs (age, total hip BMD, history of fracture, spine BMD, rheumatoid arthritis). All three nomograms demonstrated good discrimination with AUROCs of 0.72 to 0.73, well-performing calibration curves, and concordance analyses showing moderate to good reliability when compared against FRAX and the Garvan FRC [8].

No prior studies have externally validated the three FRISBEE-based prediction models. The FRISBEE cohort was limited to post-menopausal women and recruited participants exclusively from Brussels; validation in other population-based samples internationally is warranted [8]. To further assess the generalizability of FRISBEE prediction models and their performance in the clinical setting, we examined their predictive performance in a large clinical registry from Manitoba, Canada.

Methods

Study design and population

We conducted a retrospective analysis using the Manitoba Bone Mineral Density (BMD) registry. The database is a well-validated province-wide integrated repository of all dual-energy X-ray absorptiometry-bone mineral density (DXA-BMD) scans since 1997, and facilitates anonymized patient-level record linkage with other population-level computerized healthcare data.

We identified all women aged 60 to 85 years with DXA-BMD scans completed from September 1, 2012 (date when the intake questionnaire included self-reported frequency of falls in the prior 12 months) to March 31, 2018. Women were excluded if they were not registered for health care in Manitoba or had missing baseline measurements required for the FRISBEE prediction models. For those with more than one qualifying examination, only the first was included. The study was approved by the Health Research Ethics Board for the University of Manitoba.

Fracture risk prediction calculation

Five-year probabilities of all fractures, MOFs and central fractures were calculated using the FRISBEE prediction models [8]. Hip and lumbar spine DXA scans were performed and analysed in accordance with manufacturer recommendations (Lunar iDXA, GE HealthCare); total hip and lumbar spine BMD measurements were converted to equivalent Hologic units using previously published formulae for fan-beam DXA systems [9]. Height and weight were recorded at the time of the BMD test.

All included women received an intake questionnaire by mail approximately two to 4 weeks prior to their BMD appointment. Self-reported history of falls in the prior year, alcohol use (3 or more per day designated as high alcohol intake), and diagnosis of rheumatoid arthritis were collected. All data were reviewed for completeness and accuracy by the BMD technologist at the time of BMD testing and by the physician at the time of BMD reporting; there were no missing data. History of prior fractures without major trauma and other comorbidities (hyperthyroidism, ankylosing spondylitis, celiac disease, chronic liver disease, inflammatory bowel disease, cerebrovascular disease, multiple sclerosis, muscular dystrophy, chronic pancreatitis, Parkinson disease, aromatase inhibitor use, solid organ transplantation) was obtained from linkage with population-based healthcare data (hospital discharge abstracts and medical claims diagnoses since 1984), as previously described [10,11,12].

Fractures outcome definition

Provincial population-based health records were assessed for the presence of fracture diagnostic codes following the BMD assessment, up to March 31, 2018. Fractures that were not associated with trauma codes were assessed through a combination of hospital discharge abstracts (diagnoses and procedures coded using the International Classification of Diseases, Ninth Revision, Clinical Modification [ICD-9-CM] prior to 2004 and International Classification of Diseases, Tenth Revision, Canadian Enhancements [ICD-10-CA] thereafter) and physician billing claims (coded using ICD-9-CM). We previously validated these data sources for fracture detection compared with x-ray review; diagnostic algorithms were tested and adopted for national osteoporosis surveillance [13,14,15].

For the current analysis, we identified incident non-traumatic fractures, MOFs, and central fractures. Incident fractures were defined as fractures that occurred after the index BMD measurement with site-specific fracture codes, derived from hospitalizations or physician visits. All fractures were defined as any fracture (excluding craniofacial, hand and foot). MOFs were defined as clinical fractures involving the vertebrae, hip, distal forearm, and humerus. Central fractures were defined as fractures occurring proximal to the forearm and ankle, and included clinical fractures of vertebrae, hip, humerus, pelvis, ribs, scapula, clavicles, and sternum [16].

To minimize potential misclassification of prior fractures as incident fractures, we conservatively required that there be no hospitalization or physician visits with the same fracture type in the 6 months preceding an incident fracture diagnosis.

Statistical analysis

Descriptive statistics for demographic and baseline characteristics are presented as mean ± standard deviation (SD) for continuous variables, and frequency (%) for categorical variables, unless otherwise stated. The primary analysis examined the performance characteristics of the FRISBEE predictive models in stratifying incident fracture, MOF and central fracture risk [8], using area under the receiver operating characteristic curve (AUROC). Cox proportional hazards regression models for incident fractures were used to estimate gradient of risk as hazard ratios (HR) with 95% confidence intervals (CI) for each SD increase in risk (log-transformed due to a skewed distribution) and across tertiles of fracture risk (referent was lowest risk). The proportional hazards assumption was confirmed by testing log-scaled Schoenfeld residuals (all p > 0.2). Ratios of 5-year observed versus predicted all fractures, MOFs and central fractures were used to assess calibration overall (calibration-in-the-large) and for risk tertiles (calibration slope), where unity indicates perfect calibration. Observed risk was estimated from the cumulative incidence function and included the effect of competing mortality [17]. Statistical analyses were performed with Statistica (Version 13.0, StatSoft Inc, Tulsa, OK, USA) and AUROCs were generated using IBM SPSS for Windows (version 27, IBM Corporation).

Results

The Manitoba registry cohort included a total of 10,592 women aged 60 to 85 years with baseline DXA-BMDs conducted between September 1, 2012 and March 31 2018. Of this cohort, 138 women were excluded as non-Manitoba residents, and 738 were excluded because total hip and lumbar spine BMD data was not available. The final analysis therefore included 9716 women. Of this cohort, 9265 (95.4%) women were alive at final follow-up; 339 (3.5%) had died, and 112 (1.2%) had moved.

Mean age was 70.7 (SD 5.3) years. A minority of participating individuals had self-reported previous fractures (21.7%), recent falls (19.3%), high alcohol use (0.3%), rheumatoid arthritis (4.0%) and other comorbidities (14.1%) (Table 1). Mean 5-year risks of all fractures, MOFs and central fractures based on FRISBEE prediction models were 12.5% (SD 7.2), 9.5% (SD 6.9) and 9.2% (SD 7.5), respectively.

Table 1 Study population characteristics

During a mean follow-up period of 2.5 (SD 1.6) years, 377 (3.9%) individuals sustained fractures, 264 (2.7%) sustained MOFs, and 259 (2.7%) sustained central fractures. Pairwise comparisons for individual FRISBEE prediction model components and their association with incident fractures, MOFs and central fractures are presented in Table 2. Hip and lumbar spine BMD, age, and history of previous fractures and previous falls were associated with increased risks of all fractures, MOFs and central fractures (p < 0.001); high alcohol use, rheumatoid arthritis and other comorbidities were not significantly associated (p > 0.05). Mean FRISBEE risk predictions were significantly higher in those with versus without incident fractures, MOFs and central fractures (p < 0.001).

Table 2 Study population characteristics stratified by incident fracture status

AUROCs for 5-year fracture risk stratification for incident fractures, MOFs and central fractures are presented in Table 3. The FRISBEE prediction model for all fractures performed well in stratifying risk for all incident fractures (AUROC 0.69, 95% CI 0.67 to 0.72), MOFs and central fractures (for both: 0.68, 95% CI 0.65 to 0.71). The prediction model for MOFs similarly stratified risk well for all incident fractures (0.72, 95% CI 0.69 to 0.75), MOFs and central fractures (for both: 0.71, 95% CI 0.68 to 0.74). Similar results were obtained with the prediction model for central fractures (all fractures: 0.73, 95% CI 0.70 to 0.76), MOFs and central fractures (0.72, 95% CI 0.69 to 0.75).

Table 3 Area under the receiver operating curve (AUROC) for 5-year FRISBEE predictions for all fractures, MOFs and central fractures

HRs per SD increase in FRISBEE risk predictions from Cox proportional hazards models are summarized in Table 4. There was a significant gradient of risk for all fractures (HR 1.98, 95% CI 1.80 to 2.19), MOFs (2.07, 95% CI 1.84 to 2.33) and central fractures (2.26, 95% CI 2.00 to 2.55) (p < 0.001 for all three outcomes). Cumulative fracture incidence plots showed clear separation according to risk tertiles (all log-rank p < 0.001) as shown in Fig. 1. Compared to the lowest risk tertile, middle and highest tertiles were associated with significantly increased risk for all fractures (middle: HR 2.25, 95% CI 1.61 to 3.16; highest: HR 4.70, 95% CI 3.45 to 6.41), MOF (middle: 2.36, 95% CI 1.53 to 3.65; highest 5.78, 95% CI 3.90 to 8.57) and central fractures (middle: 2.41, 95% CI 1.54 to 3.79; highest: 6.50, 95% CI 4.33 to 9.75).

Table 4 Hazard ratios (per standard deviation increase from FRISBEE 5-year predictions) and calibration ratios (5-year observed cumulative incidence versus predicted fracture probability)
Fig. 1
figure 1

5-year cumulative fracture incidence with unadjusted hazard ratios (left panels) and 5-year predicted versus observed calibration ratios (right panels) for all fractures, major osteoporotic fracture (MOF) and central fracture risk, stratified by tertile

Calibration (observed versus predicted 5-year fracture risk) with the FRISBEE models is summarized in Table 4 and Fig. 1. There was overestimation in risk of all fractures (calibration-in-the-large 0.63, calibration slope 0.63), MOF (calibration-in-the-large 0.51, calibration slope 0.57) and central fractures (calibration-in-the-large 0.55, calibration slope 0.60).

Discussion

We found that the FRISBEE prediction models performed well for risk stratification for all fractures, MOF and central fractures. However, evaluation of calibration showed that the models overestimated risk for all three fracture outcomes. To our knowledge, this is the first external validation study for FRISBEE 5-year prediction models.

Several externally validated tools, including FRAX, Garvan fracture risk calculator (FRC) and QFracture, are available for use in clinical practice. The tools use different combinations of risk factors and BMD to estimate fracture risk. No robust head-to-head comparisons exist comparing the available tools against one another or against FRISBEE 5-year models [18, 19]. FRISBEE 5-year prediction models include risk associated with history of falls and previous fractures, hip and spine BMD, and a number of other established risk factors (age, high alcohol intake, rheumatoid arthritis, other comorbidities), but several other risk factors were not included based on multivariate analyses (parental hip fracture, glucocorticoid use, current smoking, high or low body mass index, sedentary lifestyle, sleep disturbances, education level, early non-substituted menopause).

The source of the observed FRISBEE miscalibration in our cohort of women from Canada is unclear, especially since the FRAX tool for Belgium generates similar probabilities to the Canadian FRAX tool. Several possibilities may contribute to the observed differences. First, miscalibration is not uncommon when prediction tools are externally validated in other populations, even when risk stratification is similar. Both overestimation and underestimation of fracture risk with Garvan FRC, and overestimation of hip fracture risk with the Fracture Risk Evaluation Model (FREM), have recently been demonstrated when applied to our cohort [20, 21]. Second, marked variation in fracture risk exists between countries. These differences are poorly understood but are likely multifactorial, reflecting unmeasured lifestyle, environmental and genetic factors [22]. Third, differences between modelled and observed fracture may reflect differences in the relative proportion of MOF to hip fractures. While the Canadian FRAX model demonstrated good calibration for MOF prediction [23, 24], the Belgian FRAX model severely underestimated risk in the FRISBEE cohort (calibration slope 2.12, p < 0.001) [25]. This discrepancy may reflect the 1.7 to 1.8-fold higher MOF to hip ratios in the FRISBEE cohort compared to the ratios used to estimate non-hip fractures for calibrating the Belgian FRAX tool [26, 27].

A major strength of our study is that our sample was representative of older women seen in routine clinical practice for assessment of bone health [28]. Our results are therefore broadly generalizable to clinical practice. There are several limitations to consider. First, our analysis was limited to those referred for BMD testing; there may therefore be a selection bias for individuals at intrinsically higher risk of all fractures. However, this would actually produce higher observed/predicted ratios (i.e., risk underestimation, not overestimation) if there are strong risk factors in the referral population not captured by the risk prediction tool. Conversely, elderly individuals living in congregate settings (e.g., nursing homes) and those with severe function-impairing disabilities or higher short-term mortality may be less likely to be referred for such testing [29]. The same limitations apply to the FRISBEE cohort. Second, fracture rates vary between countries and populations, which may limit application to settings with much higher or lower risk than in Canada [22]. Finally, mean follow-up duration was short at 2.5 years. However, this would not affect assessment of calibration which estimates 5-year fracture probability from the cumulative incidence function.

In summary, we found that recently published FRISBEE 5-year prediction models are able to risk stratify all fractures, MOFs and central fractures in Canadian women, but overestimated fracture risk, emphasizing the need for re-calibration before application in the Canadian context. Further efforts are warranted to externally validate the prediction models in other international cohorts, and should ideally include head-to-head comparisons against other available fracture risk assessment tools. Where miscalibration is identified, high-quality data (ideally population-based and representative of the population’s demographics and clinical risk factors) could be used to obtain satisfactory calibration.