Introduction

Osteoporotic fractures are serious complications of osteoporosis. Identifying those in risk of osteoporotic fractures is an important preventive strategy, for a better evaluation and on time use of prophylactic treatments. Dual X-ray absorptiometry (DXA) is the gold standard for identifying osteoporotic postmenopausal women [1, 2]. Since measuring bone mineral density (BMD) in all postmenopausal women is not a practical approach and DXA devices are not available in all regions of a country, prescreening tools should be first used to help identify high-risk patients. A number of different risk assessment tools and models have been defined and used to evaluate postmenopausal women. These scoring models are used in primary care centers for referring patients to a secondary care center or a BMD center.

Three regions are used for assessing osteoporosis: the femoral neck, the lumbar spine and total femur [3]. Osteoporosis is defined as a BMD of at least 2.5 standard deviations below the mean T-score for young adults of the same race and sex according to World Health Organization [4]. For each standard deviation decrease in bone densitometry, the risk of osteoporotic fracture increases 1.5- to threefold or even more. Appropriate cutoff values should be defined for an effective prevention and treatment of fractures [5]. According to a large cohort study, only 6.4 % of osteoporotic fractures occurred in T-scores of −2.5 and less. Also, this study indicated that 82 % of postmenopausal women with fractures had T-scores more than −2.5. The Study of Osteoporotic Fracture also showed that 54 % of hip fractures and 77 % of other osteoporotic fractures occurred in a T-score >−2.5 [6]. According to National Osteoporosis Foundation guidelines, pharmacological interventions should be started in women with T-scores of −2 or less [7, 8]. This cutoff point (−2 and more standard deviations below the mean T-score for young adults) is called the fracture threshold and is used for referring patients for bone densitometry.

Several risk factors are involved in developing osteoporosis, which ultimately leads to varying performance of prescreening models in different populations. In order to validate the use of a prescreening model in a study population, the sensitivity, specificity, positive predictive value (PPV) and negative predictive values (NPV), along with receiver operating characteristic (ROC) curve analysis should be calculated. These models have been further evaluated and calibrated in populations other than the study populations they were developed from; however, data are sparse if any, for Iranian postmenopausal women.

In this study, we evaluated seven prescreening models in Iranian postmenopausal women in order to define the most appropriate model.

Methods

The present study was a cross-sectional study in which we analyzed data obtained from 8644 postmenopausal women who had performed DXA at BMD center of Shariati Hospital in Tehran, Iran. These postmenopausal women had been referred by their family medicines, endocrinologists and nephrologists, between year 2001 and 2011. The percent of female to male, and participants with secondary osteoporosis to primary osteoporosis, who were referred to this center, were consistent throughout these years. Secondary osteoporosis, mainly due to prolonged corticosteroid usage, was <10 % prevalent among the study members. The study is approved by the ethics committee of Endocrinology and Metabolism Research Institute affiliated to Tehran University of Medical Sciences.

After obtaining an informed consent, participants were asked to fill a questionnaire. This questionnaire included demographic data as well as risk factors of osteoporosis according to previous studies and previous suggested prescreening instruments. Bone densitometer measurements were taken using DXA machine (Lunar, 7164, GE, Madison, WI) at Shariati Hospital BMD center. We used osteoporosis definition (a BMD of at least 2.5 standard deviations below the mean for young adults of the same race and sex (T-score)) as the first outcome of interest. In order to start prophylactic maneuvers, the fracture threshold was also used. The fracture threshold is defined as a BMD of at least 2 standard deviations below the mean for young adults. (T-score ≤−2) BMD results were assessed in three regions: the femoral neck, the total femur and the lumbar areas (L2-L4), selected based on previous studies [3, 911]. Diagnosis of osteoporosis was made based on T-score ≤−2.5 in at least one of these regions. Those with T-score of ≤−2 should be referred for DXA which was another outcome of interest in this study.

Prescreening models

As mentioned above, the clinical performance of seven commonly used prescreening models is compared using a large cohort of postmenopausal women in this study. These models include: osteoporosis self-assessment tool (OST), osteoporosis risk assessment instrument (ORAI), simple calculated osteoporosis risk estimation (SCORE), age, body weight and no estrogen (ABONE), osteoporosis index of risk (OSIRIS), study of osteoporosis fracture–study utilizing risk factors, (SOFSURF) and national osteoporosis foundation (NOF). The characteristics of these models are described below and shown in details in Table 1.

Table 1 Characteristics of the seven prescreening models that have been developed, evaluated and used in different populations

The osteoporosis self-assessment tool (OST) was developed and validated in 1998. As one of the simple methods for evaluating the high-risk women for osteoporosis, it is calibrated for Asian, Caucasians, men and women. The score is calculated simply by subtracting age from weight divided by 5 and was designed in women for predicting femoral neck T-score of less than (and equal to) −2.5. Based on this model, women with a score of ≤2 were considered to have osteoporosis in femoral neck area. OST was developed based on data of eight Asian countries [12, 13]. It was later validated for Caucasian European and US postmenopausal women [14]. A sensitivity of 91 %, specificity of 45 % and AUC of 0.79 was reported. Studies have reported sensitivity of 88–92 % in women and 82–85 % in men [15, 16].

Osteoporosis risk assessment instrument (ORAI) was developed and validated in 1376 women, from Canadian multicenter osteoporosis study (CAMOS) in 2000. Age, weight and use of hormone replacement therapy (HRT) were the predictors in this model. BMD score at femoral neck or the lumbar spine was the objectives of this study. Based on ORAI, a score of ≥9 should be referred for further BMD study. This study also indicated that ORAI has a sensitivity of 94 % and a specificity of 41 % [17].

The simple calculated osteoporosis risk estimation (SCORE) score was developed in a study investigating 1279 American women in 1996. Age, weight and use of HRT were predictors of this model. Rheumatoid arthritis, race and personal history of fracture were also included. Osteoporosis was defined using a T-score of ≤2. This study had a sensitivity of 93 % and a specificity of 43 % using a score of ≥6 for osteoporosis [18].

The ABONE model, using age, body weight and no estrogen as predictors, was developed in 1610 white women who had DEXA scan for the first time. This model had AUC of 0.72 when validated on 2365 women from the CAMOS database [19, 20].

The osteoporosis index of risk (OSIRIS) was developed using a retrospective database of 1303 postmenopausal women in 1999. Age, body weight, current HRT and history of previous low impact fracture were the predictor variables. AUC was 0.71. OSIRIS score of ≤1 had a sensitivity of 85 % and specificity of 39 % [21].

Another prescreening tool evaluated in this study is the study of osteoporosis fracture–study utilizing risk factors (SOFSURF) that has only been published as abstract. Age, weight, smoking status and previous postmenopausal fracture were identified as predictors of low bone mineral density. They defined the total hip score of ≤−2.5 in BMD as osteoporosis. Based on this model, individuals with a score of ≥3 should be referred for DXA. The sensitivity and specificity of this model was not reported; however, it had AUC of 0.75 [22].

Cadarette et al. defined National Osteoporosis Foundation (NOF) score, using age, body weight, personal history of fracture after minimal trauma, family history of fracture and cigarette smoking in 1999 to incorporate screening guidelines recommended by NOF. The score was validated it on CAMOS database. Based on this model, a woman with NOF score of ≥1 should be referred for further imaging [19, 23].

These seven models are used to determine whether a patient has osteoporosis or not. Another tool, the Fracture Risk Assessment Tool (FRAX), developed by World Health Organization (WHO) task forces is also developed in 2008 to predict the ten year risk of fracture. FRAX integrates the clinical risk factors with BMD at femoral neck for this purpose and is derived from patient populations in North America, Latin America, Europe, Asia and Australia. Due to different outcome of interest, evaluating the performance of this model in Iranian population was not an objective of this study [24, 25].

Statistical analysis

Data are described and classified based on different cutoff points (−1, −2 and −2.5) for BMD scores in three regions. Predictors of different models were classified, and the scoring system of each model was formulated. Sensitivity, specificity, PPV, NPV and AUC are then calculated for each model. Data were analyzed using STATA 11.1.

Results

Participants of this study were postmenopausal women between 23 and 94 years of age (mean age 59 ± 8.7 years). Patients were 18–69 years at menopause (mean 47 ± 6.4). Body mass index was between 14 and 55.24 (mean 27.91 ± 0.06). Number and percent of patients with and without osteoporosis is shown in Table 2. Three regions of interest were first analyzed separately. Table 3 presents the performance of models based on the BMD T-score of ≤−2.5 (osteoporosis) in the three regions of interest. For a better assessment of these models and the practical performance, these prescreening tools should also be tested using the fracture (referral) threshold (A BMD T-score of −2 and more standard deviations below the mean T-score for young adults). Results are shown in detail in Table 4. Tables 5 and 6 are allocated to the performance of models based on the three regions of interest together in both the osteoporosis threshold and the fracture threshold. The interpretation of these results is discussed below.

Table 2 Participants are classified based on different important T-scores in femoral neck, total femur and lumbar areas
Table 3 Different models are evaluated in femoral neck, total femur and the lumbar areas
Table 4 Different models are evaluated in femoral neck, total femur and the lumbar areas
Table 5 Performance of models at BMD cutoff score of −2.5 in three regions (the femoral neck, total femur and lumbar area)
Table 6 Performance of models at BMD cutoff score of −2 in three regions (the femoral neck, total femur and lumbar areas)

Discussion

Since mid-1990s, a number of studies were conducted to develop models that could help identify postmenopausal women at risk for osteoporosis or low bone mass who would then be referred for a DEXA scan for confirmation. These prescreening tools were later evaluated and compared in different populations to further select the most appropriate model. In this study, we applied seven prescreening models in a large sample of Iranian postmenopausal women to evaluate their performance and choose the most appropriate tool for that population.

Applying these tools in different populations has yielded different performance results. Some studies showed high sensitivities and good performance, while some others indicated lower indices than what was reported in earlier studies. In 2001, Cadarette et al. evaluated ABONE, ORAI, weight criterion (body weight <70 kg), SCORE and NOF models. The CaMos database (Canadian women) was used for this study, and these prescreening tools were applied to the data from 2365 postmenopausal women. Based on their evaluation, SCORE and ORAI were more appropriate (with sensitivities of 99 and 97 %, respectively) for detecting osteoporosis [19]. One study evaluated the use of ABONE, ORAI, OST, SCORE and weight criterion in 174 African-American women and reported lower sensitivities from what was reported earlier, ranging from 65 % for ORAI to 83 % for SCORE [1]. Evaluating the performance of ORAI, OST, OSIRIS and weight criterion in 665 Spanish postmenopausal women showed that ORAI, OST, OSIRIS and weight criterion would correctly identify only 45, 46, 37 and 70 % of women who would benefit from doing a DEXA scan, respectively. Calculated sensitivities were 58, 64, 69 and 83 % for OSIRIS, ORAI, OST and weight, respectively [2]. In one study, ORAI and SCORE were compared among three different ethnicities; Hispanics, non-Hispanic whites and African-American women. An overall sensitivity of 66 and 68 % was reported for SCORE and ORAI, respectively. Moreover, SCORE identified only 30 % of women who should have been referred for BMD [3]. A previous study of 5573 Iranian postmenopausal women showed a sensitivity of 71 % for OST and 74 % for both ORAI and weight criterion [9]. Another study with 341 Iranian postmenopausal women reported a sensitivity of only 48 % for NOF but 70 and 87 % for ORAI and SCORE, respectively [26].

Before interpreting the results of different studies and comparing them with our results, some points must be noted. First, in the studies mentioned above, sensitivity was addressed as the sole index of interest for evaluating the performance of prescreening models. Although sensitivity is an important index, clinicians do not approach their patients merely based on this index; rather, they would like to know the probability of disease in an individual with a positive test result or positive predictive value, which is itself a factor of prevalence. Another point of strength in a model is the diagnostic accuracy, a concept interpreted based on the ROC analysis and AUC. In a model with AUC of <0.5, the chance of correctly identifying a patient is even less than flipping a coin and thus considered useless. A good screening tool should have AUC of more than 0.7. Before comparing models regarding their sensitivities and specificities in different settings, AUCs should be reported and only models with moderate-to-high accuracy (AUC of more than 0.7) should be considered [27, 28]. In fact, in 2010, The US Preventive Services Task Force reviewed the performance of 14 prescreening instruments and indicated that they are only modest predictors of osteoporosis. The AUC of these models ranged from as low as 0.13 to as high as 0.87 [29].

Another important point for interpreting the results of different studies is the exact description of osteoporosis, both in studies that evaluate the performance of these tools and in the studies that developed them. For example, ORAI was developed to correctly classify women with T-score ≤−2.0 in femoral neck or lumbar spine area, while OSTA considered a femoral neck T-score ≤−2.5 as the objective of the study. While a T-score ≤−2.5 is the threshold for defining osteoporosis based on WHO [5], some tools prefer to identify at-risk patients sooner and consider a T-score of −2 as the appropriate cutoff for a prescreening model. The National Osteoporosis Risk Assessment study in 149529 white postmenopausal women reported that 82 % of women with fractures had T-scores more than −2.5. Also 67 % of them had a T-score >−2 [6]. Thus, a lower threshold provides a more secure and appropriate border for prophylactic strategies and helps assign screening intervals. ORAI, SCORE, SOFSURF and NOF have used −2 as the appropriate cutoff for detecting osteoporosis. Another important criterion is the skeletal sites of osteoporosis. Studies have shown that T-scores of the lumbar spine and total hip might be discordant in at least one diagnostic class in about 40 % patients and BMD of one anatomical location cannot predict the density in the sites [30, 31]. The International Society for Clinical Densitometry (ISCD) emphasizes that WHO classification should be considered in three regions in postmenopausal women: the lumbar spine, total hip and femoral neck. AT-score below the determined cutoff in any one of these sites is sufficient for the diagnosis [32]. This definition was only used in ABONE and OSIRIS with T-score <−2.5 as the cutoff.

With regard to the points above, considering −2.5 as the cutoff for T-score in three sites of interest, none of these models fulfill the criteria for an accurate model (AUC is somewhere between 0.6 and 0.7 for all of them as shown in Table 5). The least appropriate model based on AUC is ABONE, correctly classifying only 56 % of patients, and the less sensitive model is SOFSURF, only 44 % sensitive. OSIRIS, OSTA and ORAI are considered more appropriate models with an AUC of approximately 0.66. Although OSIRIS correctly classifies 70 % of patients, OSTA and ORAI are considered more suitable because of their higher sensitivity (76 % in OSTA and ORAI vs, 58 % in OSIRIS). When the data are analyzed in depth for different skeletal sites, OSIRIS (AUC ≥ 0.7) is considered as a good model in total femur and femoral neck areas and has sensitivity of about 70 % in these regions. For predicting osteoporosis at lumbar spine, OSTA has the highest AUC (65 %) and the highest sensitivity (75 %). Although OSTA, ORAI and SCORE are not good models regarding their AUCs, their sensitivities for the three regions are 75–90 % (shown in Table 3).

When considering the T-score of −2 and less in three skeletal sites, again none of the models has an AUC value >0.7; however, OSTA, ORAI, SCORE and OSIRIS have values of around 0.65. Adding other indices, OSTA and ORAI are considered more appropriate and SOFSURF with AUC = 0.62 and a sensitivity of only 37 % is the least appropriate tool (Table 6). With regard to different sites of osteoporosis, OSIRIS has AUC of 0.7 in total femur and femoral neck with sensitivities of 67 and 63 %, respectively. However, OSTA, ORAI and SCORE with a slightly lower AUC values have sensitivities of about 82 % and thus are considered more appropriate. In terms of lumbar spine, all AUCs are <0.65 and again none of them is appropriate. OSTA, ABONE and ORAI with AUC values of around 0.63 have sensitivity of 69 %. In the lumbar spine, SOFSURF has the lowest sensitivity and AUC (Table 4).

From a clinician’s point of view, when PPV is considered, at T-score of −2.5, although SOFSURF and OSIRIS have higher PPVs (and specificities), they are not appropriate for screening due to their low sensitivity. OSTA, followed closely by ORAI and SCORE, should be considered more appropriate with quite acceptable PPVs of around 47 % and sensitivities of around 75 %. At T-score of −2, OSIRIS has the highest PPV (71 %) in the expense of the lowest sensitivity (50 %). After classifying models based on AUCs and sensitivities, OSTA and ORAI are considered the most appropriate models, correctly classifying around 65 % of patients, with sensitivities of 70 % and PPVs of 64 % (Tables 5, 6).

Combining all of the above criteria, although none of these prescreening tools is considered “good” due to the low AUC, OSTA and ORAI (followed closely by SCORE) are more appropriate than others. This statement should be interpreted with caution. The generalizability of these findings depends on the characteristics of the studied population; although a very large sample of Iranian postmenopausal women was included here, they might not be representative of all Iranian women. The database belongs to a tertiary center, an academic hospital in the capital of Iran. Also, we previously showed that based on ORAI and SCORE, BMD orders of insured Iranians are appropriate only half of the times. Sixty percentage of BMDs ordered based on OSTA and OSIRIS were inappropriate. This percent increased up to 67 and 90 % for NOF and ABONE, respectively, necessitating further analyses [33].

In conclusion, in order to accept the use of these models among Iranian postmenopausal women, first they should be evaluated in a national sample from different provinces and different BMD centers. If results are still unsatisfactory, either newer models should be designed and tested or preformed models should be calibrated.