Introduction

Osteoporotic fractures confer a substantial burden on individual patients and societies, including an increased mortality, a loss of quality-adjusted life years, and a high economic cost. In Europe, for example, an estimated 56.9 billion euro was spent on osteoporosis in 2019, of which 64% was for the treatment of incident fractures [1,2,3]. In recent years, it has been demonstrated that some individuals experience a particularly high risk of osteoporotic fractures in the short (imminent) term [4,5,6]. Timely identification of such individuals would allow further evaluation and fracture risk management, potentially preventing the devastating fractures. While many tools have been developed to evaluate individual fracture risk [7], among others FRAX® [8], QFracture® [9], and Garvan [10, 11], there is an unmet need for validated tools — automated or manual — for the identification of individuals with an increased fracture risk in the short-term window, also termed “imminent risk.” One example of this is patients with a recent fragility fracture.

The Fracture Risk Evaluation Model (FREM) was developed for automated case-finding of individuals age 45 years or older, who are at high imminent (1-year) risk of major osteoporotic fractures (MOF) and hip fractures (HF) [12]. The tool was developed and subsequently validated using model selection on logistic regression models with routinely collected individual patient data from the Danish Health Registries as possible predictors and consists of odds ratios used for absolute risk prediction for a selected number of comorbidities. Internal validation demonstrated good discrimination (given as the area under the receiver operating curve, AUC) for MOF and excellent discrimination for HF. Using risk cutoffs of 2% for MOF and 0.3% for HF, positive predictive values were low, while negative predictive values were very high (generally > 99% for 1-year risk prediction) [12, 13].

To evaluate the generalizability and applicability of novel fracture risk assessment tools, they need to be tested in other populations and compared with widely deployed risk assessment tools. Hence, the aim of this study was to examine the ability of FREM to identify individuals at high imminent risk of fracture among women and men from Manitoba, Canada. Furthermore, we also set out to compare the performance of FREM with that of FRAX (including BMD) and age alone. The Manitoba DXA database and linked healthcare databases represent a unique opportunity to evaluate FREM in a different setting where appropriate clinical risk factor information and fracture outcomes have already been collected and verified.

Materials and methods

Study population

In the Canadian province of Manitoba (population 1.3 million in 2017), health services are provided to virtually all residents through a public healthcare system. DXA-based BMD testing has been managed as an integrated clinical program; criteria for testing have been published [14]. The program maintains a database of all DXA results which can be linked with other provincial population-based computerized health databases through an anonymous personal identifier. The DXA database has completeness and accuracy in excess of 99% [15].

The study population consisted of all women and men age 45 years or older with baseline DXA scans from January 1, 1996, to March 31, 2016. For those with more than one qualifying examination, only the first was included. We excluded those not registered for healthcare in Manitoba and those without 2 years of coverage after the first DXA (index date).

Estimation of fracture risk from FREM and FRAX

One-year predicted incidence of MOF and HF after index date was calculated with FREM using up to 10 years of prior hospital and physician claims data (implemented using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA)) [12]. Over 93% of the included individuals had at least 10 years of coverage and registry data before their baseline assessment. FREM consists of a weighted score including 38 risk factors for MOF for women and 43 risk factors for men, and 32 risk factors for hip fractures for both sexes. Recency of fracture is not reflected in FREM. These risk factors were determined by a data-driven algorithm based on Danish hospital diagnoses registry data with the aim of maximizing predictive ability without requiring plausible causal effects. Each condition considered by FREM was in this study assessed through a combination of hospital discharge abstracts (diagnoses and procedures coded using the International Classification of Diseases, Ninth Revision, Clinical Modification [ICD-9-CM] prior to 2004 and International Classification of Diseases, Tenth Revision, Canadian Enhancements [ICD-10-CA] thereafter) and physician billing claims (coded using ICD-9-CM in all years). Crosswalk tables were generated for ICD-9-CM and ICD-10 codes. Each condition was entered as present if there was a single hospitalization diagnosis or multiple physician claims diagnoses (2 or more within a 3-year time period). A single physician claims diagnosis was considered insufficient for registering a condition. Previous work has shown that this approach has good sensitivity and specificity for the diagnosis of several conditions associated with fractures [16,17,18]. In the evaluation, FREM scores were created based upon hospitalizations (in-patients only) alone (denoted FREM (H) in the following), physician claims alone (FREM (P)), or the combination (FREM (HP)). The coefficients in the Danish FREM score calculation were based on hospital in- and outpatient consults as previously described applying an ICD-9-CM/ICD-10 crosswalk to translate the ICD codes included in FREM [12].

Ten-year probability of a MOF and HF were calculated using the Canadian FRAX tool (FRAX® Desktop Multi-Patient Entry, version 3.8) as recently described [19]. Femoral neck BMD was included in the calculation. Hip DXA scans were performed and analyzed in accordance with manufacturer recommendations. The seven cross-calibrated instruments used for this study (1 DPX, 3 Prodigy and 3 iDXA, GE/Lunar Healthcare, Madison WI) exhibited equivalent phantom and in vivo calibration (T-score differences < 0.1), with stable long-term performance (coefficient of variation < 0.5%). The program’s quality assurance is under strict supervision by a medical physicist (Leslie and Metge [14]). The Canadian FRAX tool was calibrated using nationwide hip fracture and mortality data [20] and independently validated in the general population [21, 22]. In this cohort, 11,025 (14.7%) of the participants had a prior fragility fracture after age 40 included in the FRAX tool.

Moreover, we applied models including age as the only predictor of fracture risk, as it is known that age of the individual is a strong predictor in its own right [23]. These models were included to ascertain to which degree FREM, which includes age as a predictor, outperforms age alone.

Incident fracture ascertainment

Manitoba Health records for the study population were assessed for the presence of fracture diagnosis codes following BMD assessment up to March 31, 2018, using previously validated algorithms [24, 25]. Analyses were based upon hip, clinical vertebral, forearm, and humerus fracture diagnostic codes (collectively designated “major osteoporotic fractures” (MOF)) and hip fracture (HF). To minimize potential misclassification of prior incident fractures, we conservatively required that there be no hospitalization or physician visit(s) with the same fracture type in the 6 months preceding an incident fracture diagnosis. Fractures associated with codes indicating high trauma (external injury) were excluded, as FREM is not tailored to incorporate the uncertainty related to traumatic injuries.

Statistical analysis

Descriptive statistics for demographic and baseline characteristics are presented as mean ± standard deviation (SD) for continuous variables or count (proportion) for categorical variables and stratified by sex. Sex differences are tested by t test for numerical variables.

The proportions of individuals experiencing a MOF or a HF during 1 and 2 years after risk ascertainment are presented as counts and frequencies. Stratification in 1- and 2-year MOF and HF prediction was assessed from the area under the receiver operating curve (AUC) using each version of FREM, FRAX with BMD (as the upper bound reference standard), and age alone (as the lower bound base case). AUCs for FREM vs FRAX and FREM vs age alone were compared by the Hanley–McNeill method [26]. Calibration for 1-year MOF and HF prediction was assessed by comparing observed fracture risk with FREM (H), FREM (P), and FREM (HP).

Statistical analyses were performed with Statistica (version 13.0, StatSoft Inc., Tulsa, OK) and SPSS for Windows (version 24) for the ROC-AUC analyses.

Results

We included 74,828 individuals, corresponding to 67,930 (91%) women and 6898 (9%) men with a mean age of 65.3 years and a mean period of prior registry coverage of 33.1 years (Table 1).

Table 1 Cohort characteristics at baseline

The mean calculated FREM (HP) score, corresponding to the predicted absolute fracture risk, for 1-year MOF was 1.4% for women and significantly lower at 1.1% for men (P < 0.001). Furthermore, the mean FREM (HP) score for 1-year HF was 0.5% for women and 0.8% for men (P < 0.001). FREM scores determined only from hospital records (FREM (H)) and, respectively, only from physician claims (FREM (P)) were slightly lower than those determined using both sources (FREM (HP)) (Table 1).

In total, 902 (1.2%) individuals experienced a MOF during 1-year follow-up, and 1612 (2.2%) experienced a MOF during 2-year follow-up, while 136 (0.2%) experienced a HF during 1-year follow-up and 299 (0.4%) individuals during 2-year follow-up. A higher proportion of men than women experienced a MOF, and MOF as well as HF was more common with increasing age (Table 2).

Table 2 Observed fracture outcomes

Performance of FREM to stratify MOF risk gave an overall AUC of 0.652 (95% CI 0.633; 0.671) at 1 year and 0.659 (95% CI 0.645; 0.673) at 2 years for FREM (HP), with slightly lower for FREM (H) and FREM (P). The AUC for FREM (HP) was significantly higher than for age alone but significantly lower than for FRAX (P < 0.001 for both 1-year and 2-year risk). AUCs were generally higher for women than for men (Table 3 and Supplementary Figs. 1 and 2). The AUC for FREM (HP) was significantly higher than for age alone in both sexes and for both 1-year and 2-year risk (P < 0.001) and significantly lower than the AUC for FRAX in women (P < 0.001 for both 1-year and 2-year risk) but not in men (P = 0.957 for 1-year risk and P = 0.814 for 2-year risk).

Table 3 Area under the curve (AUC) for incident fracture stratification

The overall AUC for HF prediction from FREM (HP) was 0.826 (95% CI 0.795; 0.857) at 1-year and 0.817 (95% CI 0.794; 0.840) at 2-year risk, which was again slightly higher than for FREM (H) and FREM (P). The AUC for FREM (HP) was significantly higher than for age (2-year risk) but significantly lower than for FRAX (P < 0.001 for both 1-year and 2-year risk). AUCs were higher for women than for men (Table 3 and Supplementary Figs. 3 and 4). Again, AUC for FRAX was higher than for FREM (HP) for both 1-year risk (P < 0.001 for women and P = 0.034 for men) and 2-year risk (P < 0.001 for women and P = 0.005 for men). On the other hand, FREM (HP) for HF did not significantly outperform age alone for 1-year risk (P = 0.143 for women and P = 0.938 for men), as significant difference was seen for 2-year risk in women (P < 0.001) but not in men (P = 0.100). Age-specific AUC results are presented in Table 4. Specifically, we note that the AUC values obtained in individuals aged 80 + are relatively low, indicating that FREM underperforms in the oldest individuals in which prediction by age alone shows a similar performance.

Table 4 Validation of predictive models stratified by sex and age group

Investigating the calibration of FREM, we observed generally higher predicted fracture risk obtained from FREM for MOF compared to the observed fracture rate, especially in the oldest age group (85 + years) where 2.80% of the women and 2.53% of the men experienced a fracture during the first year, but FREM (HP) predicted fracture rates of 3.88% in women and 2.99% in men. A similar trend of predicted risk above the observed risk was observed for HF, with predicted risks of 2–3% in the oldest compared to observed risk of 0.5–1% (Table 5 and Fig. 1).

Table 5 Calibration of the prediction models, comparing predicted and observed 1-year fracture risk
Fig. 1
figure 1

Calibration of FREM for 1-year MOF risk (left) and 1-year hip fracture risk (right), stratified by sex and age group. Bars show predicted versus observed fracture risk

Discussion

In this study, we have applied the FREM algorithm to a large registry-based cohort of individuals from Manitoba, Canada, with the aim of validating its performance in a population that is independent, and in some aspects quite distinct, from the Danish population in which FREM had been developed and validated [12, 13].

We observed that FREM (HP) performed well for HF in women with excellent AUC values for both 1- and 2-year risk prediction yet performed less well for HF in men with moderate AUC values. The predictive values for MOF were also somewhat lower than for HF in women, with moderate AUC values in both sexes. The performance was better if applying both hospital diagnosis codes and physician claims data in the model. FREM tended to overestimate absolute fracture risk in this population, as indicated by the higher predicted than observed fracture risk, especially for hip fractures.

In terms of discrimination, FREM generally outperformed age for MOF and for 2-year HF risk prediction in women. FREM was found to be less accurate than FRAX including BMD (except in the prediction of MOF in men, where discrimination was comparable between FRAX and FREM). This was as expected, due to the inclusion of BMD in FRAX but not in FREM. It is worth noting in this context that the intention of FREM is to identify individuals on a population level who are at high imminent risk of fractures and should undergo further risk evaluation, rather than to provide a final risk estimate on which to base treatment decisions. Our results could indicate that FREM may be a useful tool for such first-line screening.

Comparing the validity of FREM in Canada with that observed in Denmark [12, 13], we determined similar levels of predictive power, indicating good generalizability of the FREM algorithm. On the other hand, we in the current study observed larger differences in the calibration of the algorithm observing much lower rates of osteoporotic fractures in the cohort than predicted by the algorithm. This indicates that predictions of absolute risk, contrary to merely identifying high-risk individuals, might require a population- and setting-specific calibration of the FREM algorithm. One possible contributing factor in the calibration difference between this study and the original FREM development is the inclusion of only DXA-scanned individuals in the current cohort, compared to the full population in the Danish cohort, although additional differences caused by differences in population risk between Manitoba and Denmark cannot be excluded. In addition, it is conceivable that the threshold for hospital contact — even on an outpatient basis — marks a somewhat more severe case than a patient seen under the same diagnosis in General Practice. This, combined with the known high osteoporotic fracture risk in the Danish population [27], likely explains the requirement for a somewhat lower calibration of the risk prediction for use in Canada.

In recent years, several studies regarding tools for imminent fracture risk assessment have been published. The studies differ in their eligibility criteria and settings, potentially resulting in rather heterogeneous fracture risk profiles, and base their model fully or partially on risk factors with a biologically plausible link to fractures [28,29,30]. A conceptually different tool was developed by Almog et al. [31], who — using a US administrative health database — applied deep learning techniques from natural language processing to patient medical history to develop an algorithm to identify individuals at risk of fracture within a 2-year period [31]. In comparison, FREM included the entire Danish population aged 45 years or older and is based on covariates identified in the analytical process to have a predictive value for fractures, irrespective of whether these fit into the pathophysiological process of fracture development [12]. This difference in covariates applied in the FREM algorithm may cause a lower face validity of FREM — as well as the algorithm developed by Almog et al. [31] — as compared to tools based on established clinical risk factors for osteoporotic fractures.

The AUC among these published tools range from 0.71 to 0.77 for HF and 0.62 to 0.73 for composite fracture outcomes, derived from development or internal validation cohorts [28,29,30]. Almog et al. [31] demonstrated AUC of 0.79–0.82 for any fractures — depending on the model framework applied — and higher if focusing on subsequent fractures only [31]. While direct comparison of FREM against these tools is not appropriate, among others as this is an external validation of FREM as compared to the internal validation of the abovementioned tools [32], the results do seem to indicate that FREM tends to perform better or on par with regard to discriminative power for the prediction of hip fractures in women. A lower AUC for composite fracture outcomes is similarly noted for FREM, as well as FRAX, in this study as well as in prior FREM studies[12, 13], and has been suggested to be due to an increased variability in predictors across the different fracture types included in such outcomes [29].

Beyond this external validation of FREM, we are only aware of one other imminent fracture risk assessment tool having been validated externally, though these data have yet to be published in full. The available data indicate that the IFRISK tool achieves an AUC for HF in the internal validation of 0.78, while in two external validations, it was 0.70–0.71[33].

Uncertainties remain about the appropriate time horizon for fracture risk assessments, and it has been argued that a 10-year time frame is most relevant [34]. Nevertheless, the appealing perspective of imminent fracture risk assessment tools is the potential of improved prevention of osteoporotic fractures. Hence, evidence-based pharmacological and non-pharmacological means to manage an increased fracture risk are available [35], and there is an increasing call for action in this setting [36,37,38]. Despite this, fragility fracture prevention remains insufficient [38,39,40], indicating a need for other ways of identifying individuals at risk. The FREM tool holds the potential for automated screening, as it only includes diagnoses, which are often available in electronic databases and patient files, and we believe that the incorporation of this or a similar tool may prompt a more proactive approach to fracture risk management among physicians. However, this has yet to be evaluated in clinical trials. As with all risk assessment tools, perhaps in particular those leveraging administrative health data, there is a risk that they be used by healthcare insurers to screen (potential) subscribers, and this risk should be mitigated through legislative efforts.

The main strength of this study is the validation on a large cohort distinct from the original setting FREM was developed in. Moreover, due to the Manitoba database including both diagnoses from hospitals as well as from general practitioners, we could investigate the dependence of FREM on data source, hereby increasing the generalizability of the algorithm to settings with different regimes of administrative data collection. Furthermore, the availability of BMD measurements for the included participants enabled for the first time a direct comparison between FREM and the much-used FRAX algorithm, enabling us to quantify the degree to which administrative information can approximate the information content of clinical measurements.

Acknowledged limitations to this study include the reliance on a clinical BMD registry, though this captures all clinical BMD testing in the province of Manitoba and therefore reflects a broad range of individuals typically referred for fracture risk assessment. Women referred for testing have mean BMD measurements that are average for age with prevalence of FRAX risk factors that are also similar to the general Canadian population; as expected, men typically show greater referral bias [41, 42]. Another limitation is the use of a 10-year lookback in this study, whereas FREM was originally developed using a 15-year lookback [12]. However, as FREM was subsequently validated in a Danish cohort using a 5-year lookback [13], we do not expect this to be a source of error in the fracture risk prediction in this study.

Moreover, differences between the Canadian sample included in this study, and the Danish sample used to develop FREM, are both a weakness and a strength of the current study. There are differences with respect to ethnic composition, Manitoba being much more ethnically heterogeneous than Denmark with a large proportion of immigrants, which might contribute to the need for recalibration. On the other hand, significant stratification with FREM, in spite of such differences, indicates that FREM might show acceptable performance in a wide range of populations after recalibration.

Future research on FREM should explore the additional predictive value obtained by including registry data concerning the use of medications and potentially biochemistry results obtained as part of daily clinical practice. Furthermore, as has been demonstrated in this study, there is a need to calibrate the tool for use under local conditions, as has also been done for FRAX. Moreover, taking into account the recency of previous fractures as well as of other risk factors might have the possibility to further improve FREM, similar to that which has been observed for FRAX [43]. Hence, a future development of an extended version of FREM taking recency as well as frequency of predictors into account is desirable. Additionally, studies investigating the cost-effectiveness of implementing FREM in screening programs would be desirable.

In conclusion, this study documents that FREM has good validity for predicting the risk of osteoporotic — and particularly hip — fractures also in populations different from the Danish population in which the algorithm was developed but with the need of population-specific calibration if the aim is absolute risk prediction instead of just risk stratification. FREM performs better than age alone but weaker than FRAX including BMD measurements, and inclusion of physician claims enhanced the performance of FREM.