Introduction

Osteoporosis risk assessment tools have been developed to objectively select postmenopausal women who could benefit from central (hip and lumbar spine) bone mineral density testing. The best validated tools are the Osteoporosis Self-assessment Tool (OST) developed in an Asian study population [1], the Osteoporosis Risk Assessment Instrument (ORAI) from a population-based Canadian cohort [2], and the Simple Calculated Osteoporosis Risk Estimation (SCORE) from a study population recruited from US academic and community-based medical centers [3]. Despite multiple validation and comparative studies in postmenopausal women [411], these tools have yet to be used in clinical practice in the US. The main methodological barriers to clinical use have been lack of validation in a large, population-based US database and varying performance of the original cut points among different study populations.

We evaluated the diagnostic accuracy of the OST, ORAI and SCORE to detect low bone density in white women aged 67 years and older from the Study of Osteoporotic Fractures (SOF) cohort. Our objective was to test whether the tools could identify low bone density accurately enough to be useful for clinical decision-making in elderly US white women.

Methods

Study population

The SOF inception cohort included 9704 ambulatory white women aged 65 years and older recruited between 1986 and 1988 from population-based listings at four US sites: Baltimore, Maryland; Minneapolis, Minnesota; the Monongahela Valley near Pittsburgh, Pennsylvania; Portland, Oregon [12]. Women with bilateral hip replacements were excluded. All participants provided informed consent, and the study was approved by the appropriate institutional review committees of all participating sites. The age range of the SOF cohort was appropriate for testing of the osteoporosis risk assessment tools, since the development cohorts of the tools included women aged 45 to 80+ years.

We conducted a secondary analysis of the SOF Online public database http://sof.ucsf.edu/public/] that included 7779 SOF participants with technically adequate bone mineral density measurements and a complete set of variables to calculate the risk scores at the second follow-up visit (1/89–12/90; this was the earliest visit at which central [hip and lumbar spine] bone density testing was performed). The number of participants with a complete set of variables differed for each risk tool, e.g., N = 7617 for OST, N = 7679 for ORAI, N = 7235 for SCORE. (Note: these N values are from our analysis of the SOF Online database. An investigator [LL] at the SOF Coordinating Center repeated the N calculations in the complete SOF database, which includes confidential extreme values for continuous variables that are not available online. This only yielded about 170 additional eligible participants without significant differences in the ROC curve analysis results [results available upon request]. Thus, we conducted all analyses on the SOF Online data only.) The secondary analysis protocol was reviewed and approved by the Institutional Review Board of the University of North Carolina.

Variables

Bone mineral density (BMD) of the femoral neck and lumbar spine was measured using dual energy X-ray absorptiometry (DXA, Hologic, Waltham, MA). T-scores ([BMD of participant - mean BMD of reference population]/SD of BMD of reference population) are the basis for the World Health Organization diagnostic criteria for osteoporosis [13]. Femoral neck T-scores were calculated using NHANES III bone density norms for non-Hispanic white women aged 20–29 years [14]. Lumbar spine T-scores were calculated using Hologic densitometer manufacturer norms for women aged 30 years [15].

The following were the published reference variables for the tools: femoral neck T-score ≤−2.5 for the OST, femoral neck or lumbar spine T-score ≤−2.0 for the ORAI, and femoral neck T-score ≤−2.0 for the SCORE. We also tested the World Health Organization (WHO) diagnostic criteria for osteoporosis (femoral neck or lumbar spine T-score ≤−2.5) as an alternative reference variable in the receiver operating characteristic (ROC) curve analysis.

The classification variables were OST, ORAI and SCORE risk scores calculated using published formulas based on these clinical factors: age, weight, estrogen use, race/ethnicity, nontraumatic fracture after age 45, rheumatoid arthritis (Table 1) [13]. At least four different formulas have been used to calculate the OST [16]; we used the simplest formula [0.2 × (weight in kg − age) without truncation]. The OST was transformed to (-OST) for use in the receiver operating characteristic curve analysis. Age, weight, and prior fracture were also tested as classification variables, because they have been found to be among the strongest determinants of osteoporosis [17] and fractures [18] in published analyses.

Table 1 Osteoporosis risk assessment tools used in the analysis

Statistical analysis

We constructed separate receiver operating characteristic (ROC) curves and determined the area under the curve (AUC) and the corresponding 95% confidence intervals for each tool to identify the appropriate reference variables. Each point on the ROC curve marks the sensitivity and corresponding value of [1-specificity] for a given cut point on the risk tool. The AUC is an overall estimate of the accuracy of the risk score to identify patients with low bone density. This area could range from 1 for a perfect test, to 0.5 for a test that performs no better than chance. The risk tools under study showed AUC values of 0.79 to 0.81 in the development cohorts and 0.79 in a comparative study (ORAI and SCORE only) [7].

We conducted cut point testing to test risk score thresholds for each tool and individual risk factors (age, weight, prior fracture) in the SOF database. We first used the cut points from the development papers, which were set to achieve 90% sensitivity to detect the reference variables in the development cohorts. For age, a cut point of 65 years or older was reasonable because current evidence-based guidelines recommend routine screening in this age range of women [19]; however, because all of the women in our dataset were aged 67 years or older, the entire cohort was classified as “high-risk” by this criterion. For weight, we used the cut point of <127 pounds (<57.7 kg) from the National Osteoporosis Foundation clinical guidelines for osteoporosis management [20]. We calculated the sensitivity (true positive fraction), 1-specificity (false positive fraction) and corresponding binomial 95% confidence intervals using the exact method [21] for each risk tool applied to the SOF cohort. We then adjusted the cut points to achieve 90% sensitivity for each tool to identify the published reference variable, since the original cut points did not extrapolate to 90% sensitivity in the SOF cohort.

We used likelihood ratios as clinically relevant measures to summarize diagnostic accuracy [22]. We focused on the likelihood ratio negative (LR-) because past analyses have shown the osteoporosis risk tools to be most useful for ruling out low bone density [4, 16]. New cut points were set to achieve an LR- of 0.1–0.2 for the low-risk group, since LR- values in this range generate moderate shifts in pre- to post-test probability to rule out patients without disease [23]. The corresponding likelihood ratio positive (LR+) values and large sample 95% confidence intervals [24] were calculated for each classification variable.

For the OST tool, we constructed a flow diagram of test outcomes when the cut point for LR- 0.1–0.2 was applied. Since current guidelines encourage screening for women aged 65 and older, we retained a dichotomous cut point that would exclude fewer women than a strategy using multilevel likelihood ratios.

All analyses were performed using SAS 9.1 [25] and Stata SE 9.0 software [26], including the Stata diagnostic accuracy analysis programs BINSCRN1, EMROC and AUCBS [27].

Results

Of the 7779 SOF participants in the analytical cohort, 65.1% were aged 67 to 74 years, 29.8 % weighed below 60 kg, and 42.7% were past or current users of hormone therapy (Table 2). Thirty-six percent reported a baseline history of some type of fracture since age 50 years.

Table 2 Risk variables and bone density measures in participants in the Study of Osteoporotic Fractures (n = 7779)

The overall discriminatory ability of the tools as measured by the AUC ranged from 0.60 for history of any fracture to 0.76 for the OST when the original reference variables were used (Table 3). The AUC for each tool was slightly lower when WHO diagnosis of osteoporosis (T-score ≤−2.5 at either femoral neck or lumbar spine) was used as the reference variable, except for the ORAI.

Table 3 Area under the receiver operating characteristic curves (AUC) for osteoporosis risk assessment tools to identify low bone mineral density (BMD) in women aged 67 years or older, n = 7779

LR- values ranged from 0.13 to 0.31 for the risk tools when cut points from the development papers were used, and from 0.72 to 0.96 for fracture-related variables (Table 4). The cut points required to achieve 90% sensitivity to identify low bone density were different from the cut points used in the development studies (Table 5). The cut points required to achieve LR- 0.1–0.2 (to rule out women without disease) were also different from the cut points used in the development studies, except for the SCORE (Table 6). The corresponding sensitivities (96% to 99%) of the tools were high, and specificities (6% to 22%) were low using the adjusted cut points. A weight <176 pounds (<80 kg) was required to achieve an LR- value of 0.1–0.2 to rule out women without low bone density.

Table 4 Diagnostic performance of osteoporosis risk tools to identify low BMD using published cut points (development studies)
Table 5 Diagnostic performance of osteoporosis risk tools using cut points to achieve 90% sensitivity to identify low BMD
Table 6 Diagnostic performance of osteoporosis risk tools using cut points to achieve LR- 0.1–0.2 to identify low BMD

Figure 1 shows a flow diagram of OST test outcomes when the cut point was set to create a low-risk category with LR- 0.1–0.2. Of women with a low-risk OST score (>1), 4.2% (57/1368) had femoral neck osteoporosis. Of women with a high-risk score (≤ 1), 24.1% (1505/6249) had osteoporosis. Considering all women without osteoporosis, 21.7% (1311/6055) had a low-risk score.

Fig. 1
figure 1

Application of the Osteoporosis Self-assessment Tool (OST) to identify low BMD in the study sample of white women aged 67 years and older

Discussion

We conducted the largest US population-based validation study of osteoporosis risk assessment tools to date. Overall discriminatory ability was best for the OST, which is based only on age and weight. When cut points were set to achieve LR- 0.1–0.2 (to rule out women without disease), the tools excluded fewer than 1/4 of women without low bone density (e.g., only 21.7% of women without osteoporosis had a low-risk OST score). Weight alone was at least as accurate as two of the three risk tools (ORAI and SCORE), suggesting these risk score calculations offer minimal advantage at the point of care for women aged 67 years and older.

Similar to past comparative studies, we found the OST to have discriminatory ability equivalent to that of more complicated osteoporosis risk tools [4, 6, 10]. Unlike past studies, we found weight to have an area under the curve equivalent to or greater than the ORAI and SCORE, especially if T-score ≤−2.5 was used as the reference variable in the analysis (with femoral neck or lumbar spine T-score ≤−2.5 as outcome, AUCs were: weight 0.72, ORAI 0.70, SCORE 0.71). This finding may be due to the lower mean bone density in the SOF cohort compared to that in cohorts from other analyses. For example, Cadarette et al. [7] compared a weight criterion (<70 kg) to the ORAI, SCORE and other risk tools (not including OST) in 2365 Canadian women aged 45 years and older. The AUC for the weight classification variable increased by 0.11 when a lower femoral neck T-score was used as the reference variable (AUC 0.68 for weight vs. T-score <−1, AUC 0.79 for weight vs. T-score ≤−2.5). The corresponding AUCs for the ORAI and SCORE increased by only 0.08 (AUC 0.72 for ORAI vs. T-score <−1, AUC 0.80 for ORAI vs. T-score ≤−2.5; AUC 0.71 for SCORE vs. T-score <−1, AUC 0.79 for SCORE vs. T-score ≤−2.5). Of note, Dargent-Molina et al. found that weight was the strongest determinant of very low BMD (T-score ≤−3.5) in 6958 women aged 75 years and older participating in the EPIDOS study [17]. Thus, body weight might be especially useful in identifying women in the lower ranges of bone density that are more common with advancing age.

In its 2002 evidence report on osteoporosis screening, the US Preventive Services Task Force (USPSTF) reviewed ten cross-sectional studies of osteoporosis risk assessment, including the development studies for the ORAI and SCORE risk assessment tools [19, 28]. Methodological limitations of these studies were noted, especially lack of validation and lack of generalizability due to a small number of subjects or nonrepresentative subjects. The Task Force concluded that further validation of existing risk instruments or development of new ones would be useful. Results from our study have better generalizability for US patients than most previous validation studies because our sample was large and was drawn from a US population-based cohort study. However, our results only apply to women aged 67 years and older, not to the full range of ages tested in the risk tool development cohorts (45 to 80+ years). The third National Health and Nutrition Examination Survey (NHANES III, 1988–1991) found a 4% prevalence of osteoporosis in women aged 50–59 years and a 20% prevalence in women aged 60–69 years [29], so that about a 10% prevalence of osteoporosis might be expected in the 50-to-66-year age range. Future studies should test performance of the risk tools in younger postmenopausal women, since excluding low-risk individuals is especially important in population groups with lower prevalence of disease. Future studies should also explore the economic implications of clinical use of the osteoporosis risk tools. Although a secondary analysis supported a possible economic benefit [30], the uncertain generalizability of these tools and lack of standard cut points across different clinic populations (e.g., different ages of postmenopausal women, different ethnic groups) remain barriers to clinical use.

Since the Task Force report was published in 2002, secular change in medical policy has called into question the clinical role of the osteoporosis risk tools. Current research on risk assessment has focused on development and validation of the World Health Organization absolute fracture probability algorithm [31]. This algorithm determines 10-year absolute fracture risk for previously untreated patients aged 50 years and older based on the following risk factors: prior fracture, age, weight, family history, smoking, alcohol, corticosteroid use and other secondary causes of osteoporosis, and femoral neck bone mineral density [32]. Economic cost-effectiveness modeling will be used to determine reasonable treatment thresholds for different countries. Only patients with bone density T-scores ≤−2.5 are diagnosed with osteoporosis by current WHO criteria; in contrast, some patients with osteopenia plus other significant risk factors will have an absolute fracture risk above the treatment threshold according to the new algorithm. Since the WHO algorithm includes bone density measures, a standardized approach to selecting patients for bone density testing is still needed; however, such an approach will probably be dictated by upcoming policy changes. In October 2006, the Osteoporosis Work Group of the Physician Consortium for Performance Improvement reported central DXA screening for all women aged 65 and older as a Clinical Performance Measure [33]. Thus, one might anticipate little need for osteoporosis risk assessment tools in older women. However, pending legislative cuts are expected to decrease reimbursement for the performance of DXA bone density testing by over 70% by 2010 [34]. This reduction in reimbursement may lead to less testing, even in older women. These external factors are likely to influence physician decision-making much more than available risk tools. Finally, estrogen use has declined sharply after the Women’s Health Initiative investigators reported adverse outcomes from estrogen therapy in postmenopausal women in 2002 [35]. Thus, the estrogen variables in the ORAI and SCORE are less relevant for a greater number of women.

Our study had several limitations. Twenty percent (1925/9704) of the original SOF participants were excluded in this analysis due to drop-out between the baseline and second study visit, or technically inadequate or missing bone density measures or risk factor data. Results from the remaining sample may not be generalizable to all postmenopausal women in US primary care settings. We could only analyze women aged 67 and older in the SOF database; this limited our analysis of the ORAI tool, since an age of 67 years or older placed all participants in the high-risk category using the original cut points. As is true of most previous analyses of these tools, our sample comprised only white women; our results should not be extrapolated to nonwhite women, since the tools may have lower ability to detect low bone density in those groups [36]. Finally, the tools were designed to identify women with risk factors for osteoporosis who may be candidates for bone density testing; the tools do not identify osteoporosis accurately enough to be used as the sole basis (without bone density testing) for treatment decisions.

Conclusions

Our analysis does not support use of the osteoporosis risk assessment tools in US women aged 67 and older. The tools showed fair ability to rule out women without osteoporosis. However, weight alone showed comparable performance, indicating that these tools have little or no advantage over simple weight assessment in osteoporosis screening protocols. An accurate diagnosis of osteoporosis by current WHO criteria can only be made using direct measurement of bone mineral density. As financial limitations on screening increase, primary care physicians should direct screening efforts toward older patients with lower body weight to ensure that the highest-risk patients are selected for diagnostic bone density testing.