Introduction

Osteoporosis is a multi-factorial disease, determined by interplay between genetic and non-genetic factors. Genetic effects are subtle, and to elucidate these effects one requires large numbers of subjects in whom high quality clinical data are matched with a widespread genomic approach to disease or trait pathogenesis. We have set up the Family and Osteoporosis (FAMOS) study in eight European centers to ensure that there are adequate numbers of families to give statistical power to detect subtle genetic effects against a noisy non-genetic background. The primary aim of this study was to conduct a genome-wide scan for linkage to BMD as assessed by axial dual energy X-ray absorptiometry (DXA), a key component in determining the risk of future osteoporotic fracture.

Since their introduction in the late 1980s, DXA bone densitometers have been used extensively in clinical practice and epidemiological studies [1, 2]. Their clinical value in the management of osteoporosis has been reviewed recently [3]. A variety of different versions of these machines is now in common use, each with different operating characteristics. Our study, like previous large epidemiological and pharmaceutical studies, involves multiple centers and a range of different DXA densitometers. For genetic studies such as FAMOS, in which samples that have been collected across all centers will be combined, it is imperative that data are consistent across centers, and, therefore, cross-calibration of DXA devices is essential.

In previous studies various standardization techniques have been used, and cross-calibration equations have been derived from measurements either from healthy volunteers or from a variety of semi-anthropomorphic phantoms [3]. The European spine phantom (ESP) was developed to provide a standard that does not depend on the software algorithms of any particular manufacturer [4, 5]. The ESP provides three test densities (0.5 g/cm2, 1.0 g/cm2 and 1.5 g/cm2) and a semi-anthropomorphic shape. These characteristics allow non-linear calibration curves to be fitted, if required, but challenge the image analysis software of the manufacturers, which is a major and often inaccessible component of DXA machines.

The initial approach to cross-calibration using phantoms was challenged, and this led to an enhancement that minimized differences between the three major manufacturers, using only healthy women and producing cross-calibration equations for the lumbar spine and total hip [6]. Such an approach can be, and is, used to produce a standardized BMD and, hence, a standardized “T-score” for diagnostic and epidemiological purposes (e.g., osteoporosis = T<−2.5, i.e., 2.5 standard deviations below the young normal mean) by use of a recognized normative range such as the National Health and Nutrition Examination Survey (NHANES) III database [7] for the total hip DXA site. The approach works reasonably well for T-score comparisons for the lumbar spine and total hip as well as for the femoral neck, Wards and trochanter regions [8].

In the FAMOS study we wished to recruit probands, not with a single cut-off value of BMD that could be assessed by a T-score, but to recruit on the basis of low age-matched BMD or Z-scores. This required not only production of careful cross-calibration equations but also the use of a standard normative range. The cross-calibration exercise reported here was used to cross-calibrate 27 DXA scanners located at 17 hospitals under the aegis of the eight collaborating centers comprising the FAMOS study. The target was to collect extended pedigrees from probands who had age-matched BMD values of 2 standard deviations below the age-matched mean (Z-score of −2.0 or less; lower 2.5% of the normal distribution) at the femoral neck and/or lumbar spine. Immediate relatives of these probands were also scanned, and if any of their BMD scores reached 1.28 standard deviations below the age-matched mean (Z-score of −1.28 or less; lower 10% of the normal distribution), their first-degree relatives were invited to participate, too. In that way it was possible to generate multi-generation pedigrees for further genetic analyses. Therefore, individual age-matched BMD values measured at each center had to include a close-to-identical proportion of the population to be selected at each center having a Z-score of −2 for probands and, for first-degree relatives, a Z-score of −1.28. The approach we now report uses the ESP for the initial part of the cross-calibration, as described previously [9].

Methods

Seventeen centers were involved in this study, and measurements have been taken from 27 different DXA machines from three manufacturers (Norland, Cooper Surgical, Trumbull, Conn., USA; Lunar, GE Medical Systems, Madison, Wis., USA and Hologic Bedford, Mass., USA), with nine different versions of each scanner from the three manufacturers. Machines and their locations are given in Table 1. At the start of the study a single ESP prototype was used to calibrate all machines. The ESP was positioned on each DXA machine and scanned in air ten times, without being repositioned, using the machine-specific lumbar spine scanning protocol and analysis.

Table 1 Hospitals and DXA scanners used in the FAMOS study (DK Denmark, NL The Netherlands)

An analysis of differences between various versions of the same manufacturer was then done by treating machines of the same version as independent replicates, allowing data collected for calibration to be used also to study the nature of variation between these different versions. To this end, a linear mixed model was fitted, using the residual maximum likelihood (REML) directive of GENSTAT [10]. Differences between different DXA versions, between the three ESP densities and the interaction of ESP density and DXA version were treated as fixed effects. Differences between machines of the same type and any interactions with this term were treated as random effects. Residual variation from this model comprises differences within ESP densities on each machine (i.e., between the ten replicate readings).

Since, at each scan, all three BMD densities were recorded together, estimates of variation within ESP density classes on the same machine may be correlated with each other. Such an effect could bias significance levels. However, estimates of the three correlations between the three pairs of BMD densities, pooled over machines, were small and statistically insignificant (0.054, −0.067, and 0.116). We therefore made the simplifying assumption that errors from replicate measurements on the same machine, but with different ESP density classes, are independent. We treated differences between similar versions of DXA scanner (up to 3% variation) as random effects, as we regarded these as replicates of the same type of machine, as might be observed in general clinical use.

Short-term stability of the machines used in this study was assessed by examining the standard deviation of the ten replicate readings on each machine. Calibration curves were fitted to all machines for the variable densities of the ESP, using the statistical package GENSTAT [10]. Initially, quadratic curves were fitted for each machine. Although these gave excellent fits to the data, extension of the curves outside the range of the ESP produced a maximum BMD value, for some machines, which was lower than some of the values seen in practice using those machines. As a consequence, very high BMD values observed on these machines (typically for L3 or L4) could not be converted to ESP-adjusted values, since the observed BMD value would not fall on any part of the calibration curve. For this reason, an exponential curve was fitted to all machines. These gave as good a fit within the range of the phantom but behaved better at very high BMD values with the curve rises to an asymptote outside the range of BMD values seen in practice. An example calibration curve is given in Fig. 1.

Fig. 1
figure 1

Example of the quadratic and exponential calibration curves (Oxford Hologic QDR 1000). Each data point plotted represents ten observations at each of three densities

Calibration curves were not constrained to pass through zero, as it would not produce a more accurate calibration curve for the majority of values that fall within the range of the ESP and may well produce less accurate calibrations over this range.

We validated the effectiveness of the calibration curves, using ESP-calibrated and non-calibrated BMD values from hip and spine measurements on 991 female family members of the probands scanned as part of the FAMOS collaboration. The validation allowed us to partition variation into components between and within scanners. If calibration has been effective in removing variation between machines, then an analysis on calibrated data should remove less variation than an analysis on raw, non-calibrated data. The effect of age on the calibrated and raw BMD values was examined, using simple linear regression.

Results

Results from the model fitting exercise are given in Table 2. Differences between various versions from the same manufacturers, between densities, and their interaction are all extremely significant (P<10−6). However, there was no evidence of significant variation between machines of the same version (P=0.213). This implies that identical versions of machines from each manufacturer behave very similarly. However, variation that can be attributed to the interaction between ESP values and the machine type is large (P<10−6), approaching the replicate variance itself, hence confirming the need for separate calibration curves for each machine.

Table 2 Analysis of differences between machines (SE standard error, NS not significant)

Mean values of ESP for all 27 machines, grouped by version, are given in Table 3. It can be seen that Hologic machines tended to underestimate nominal BMD values across the range of densities, while Lunar machines tended to overestimate the nominal value. The two Norland scanners showed a mixed picture, overestimating the density at 0.5 g/cm2 but underestimating the higher BMD values, substantially so in the case of the 1.5 g/cm2 nominal vertebra. Standard errors averaged 0.0102 g/cm2 across all machines, with the minimum error being 0.0042 g/cm2 and the maximum 0.0134 g/cm2.

Table 3 Average BMD at three ESP nominal values for 27 machines in the FAMOS study

The analysis of the standard deviation of readings, used as a measure of repeatability for each of the machines, separately, is given in Table 4, and the mean effects for different versions of the same machine are shown in Table 5. There are large and significant differences between versions, between ESP BMD values and in the interaction of these two effects. The low numbers of machines for many versions makes interpretation of the mean effects tenuous. However, on the basis of these data, Hologic machines are the most stable, followed by Lunar, then Norland (although data here are based on only two machines of different specifications). For Hologic and Lunar, repeatability decreases with increasing density, but, as there were only two Norland machines, no comment can be made on linearity with this manufacturer’s machines. The significance of the “machines within versions” component of variance demonstrates that there is at least as much or more variation in repeatability among machines of the same version as that seen within machines of different versions and manufacturers. This is in contradistinction to the data on the results of ESP BMD, where variation between machines of the same version was much less (and insignificant) than the variation between different versions of the same manufacturer’s machine and different manufacturers.

Table 4 Analysis of repeatability across machines (SE standard error)
Table 5 Average repeatability across DXA versions and ESP nominal BMDs. Standard errors vary with number of machines tested for each model

Means and variances for BMD at each of three lumbar vertebrae, at the femoral neck hip, and at the trochanter, for 991 female probands from the FAMOS study, are given in Table 6. Mean values after calibration were numerically greater than mean values before, and this produced similarly numerically greater variances. The percentage variation accounted for by scanner and, for comparison by age, is presented in Table 7. It is clear that substantially more variation is accounted for by machine when raw BMD is analyzed than when calibrated BMD is analyzed. This is consistent with the calibration’s being effective in removing differences between different machines and different versions of the same machine. Percentage variation accounted for by regression on age is always highly statistically significant. The percentage variation accounted for is similar with both calibrated and raw BMD, and there is, therefore, no evidence that the calibration has introduced a new source of error variation into the measurement of BMD on these individuals.

Table 6 Mean and variance of BMD at each of three scan sites on female non-probands from the FAMOS study
Table 7 Percentage of variance in BMD accounted for by regression on scanner type (manufacturer and variety) and on age

The results of the cross-calibration were used to derive center- and machine-specific cut-offs based on Z-scores of −1.28 and −2.0. In summary, the cut-off, post-calibration BMD was calculated from the Hologic male and female normative ranges, as this was the most commonly used manufacturer of densitometer in our collaboration. Femoral neck and lumbar spine (L2–4) post-calibration BMD was calculated from the derived machine-specific exponential equations. For each age, a cut-off standardized BMD was used to calculate what the cut-off for measured BMD would be for both male and female subjects and at both chosen regions of interest. The Z scores for the cut-off values were then calculated by the equation:

$$ {\text{Z}}\;{\text{score = }}\frac{{{\text{Standardized}}\;{\text{BMD-Measured BMD}}}} {{{\text{1SD}}\;{\text{Standardized BMD}}}} $$
(1)

The individual center- and machine-specific Z-scores were used to recruit subjects for the FAMOS genetic study, as indicated in the introduction.

Discussion

Previous efforts to cross-calibrate DXA BMD have been based on measurements of phantoms [9], human volunteers [6] or a mixed approach [3]. These approaches produce reasonable agreement for absolute BMD, hence allowing clinical measurements on one scanner to be compared with those from another manufacturer. However, using standardized BMD for multi-site epidemiological studies is more problematic as there is a need to relate the findings to the expected age-associated bone loss, which may be center-specific. To enable such comparisons, a European normative range for both lumbar spine [11] and femoral neck [12] was developed, using the ESP for both lumbar spine [11] and femoral neck [12] was developed using the ESP. The same approach was used to determine real geographic differences between BMD at centers across Europe taking part in the European Vertebral Osteoporosis Study (EVOS) [13] and at different centers in subjects with osteoporotic fractures [14].

Our primary aim was to conduct a genome-wide scan for linkage to BMD as assessed by axial DXA. To advance this program, we required not only to have a method of standardizing BMD across eight different sites in Europe but also to relate the results to a normative range, thus enabling selection of approximately equal proportions of female probands at each site with equivalent age-specific standardized BMD. Accordingly, we first calibrated 27 DXA machines using an ESP and then validated our calibration equations in a collection of 991 female subjects collected as part of the FAMOS study. The considerable reduction in variation between machines after calibration, coupled with the absence of a similar reduction in variation that can be attributed to a linear regression on age, demonstrates not only the effectiveness of the calibrations but, moreover, the importance of carrying out this exercise. There was, however, variation remaining between machines after calibration, and, while this could reflect inaccuracies in the calibration exercise, it is more likely that these residual effects were a result of real differences among local populations in our sample, as has been shown previously [13, 14].

Having derived the cross-calibrations, we then calculated center-specific cut-offs for age-corrected BMD (Z-scores) from a normative range of standardized BMD. We originally planned to use the previously ESP-derived European normative range, which give standardized BMD for the lumbar spine [11] and femoral neck [12], our chosen regions of interest for proband selection. Although the site recognized as the standard for diagnosis of osteoporosis is total hip BMD [15], this site was not available in this study as we chose to use historical BMD databases from which to select our probands. Our decision not to use the European normative range was based on the fact that the NHANES III normative range is now applied as standard on all the BMD devices used in this study; although this normal range is available for only the hip sites [7]. We decided to standardize the data using the “Hologic” normative range, as the majority of scanners available in our centers were from this manufacturer. The new Hologic normative range is NHANES-compatible for the hip [16]. Unlike the old Hologic normative range for hip BMD [17], the normative range for spine BMD is consistent across the two major manufacturers.

Despite use of the most prevalent normative NHANES (for hip) and manufacturer-specific for spine ranges available in our study centers we still used our cross-calibration algorithm for each machine because, after modeling, we still detected significant differences between different versions from the same manufacturers after cross-calibration with the ESP, as shown in Table 2.

Given the relatively small number of versions of machines from some manufacturers in this data set, it is difficult to draw firm conclusions about the merits of the different versions or manufacturers. However, there are differences between machine versions in repeatability, and there are also significant differences in calibration equations between machines of the same version. This gives support to the cautious approach taken in this study of producing a separate calibration curve for each machine. However, not all scanners used in the FAMOS study collaboration were calibrated. In addition to the 27 scanners calibrated here, some scanners have been used for small numbers of individuals who were typically relatives of probands who lived a considerable distance away from one of the centers in this study. For these individuals, it was impracticable to calibrate all the scanners involved and is of lesser importance to our data collection, as the vast majority of these individuals were family members rather than probands. Fortunately, it is clear from the clustering of values given in Tables 3 and 4 that an average calibration curve for the version, or if the specific version is unknown then for the manufacturer, while not ideal, is better than excluding the data.

The problem of observed BMD values producing calibrated values that fall outside the range of the ESP merits further comment. There were 4,847 calibrated BMD values available for use in our validation study from the five measured scan sites (L2, 3, 4, femoral neck, and trochanter) in the cohort of 991 women. Of these 4,847 values, 1.55% fell below the 0.5 g/cm2 lower ESP value, and 0.99% were above the upper 1.5 g/cm2 ESP value. Hence, 2.54% of the observations fell outside the range of accurate calibration. In retrospect, a phantom with more extreme values than the ESP used here would have been preferable. For the FAMOS study we have assumed that the projection of the calibration curves outside their range does not introduce additional sources of error. In this respect, fitting an exponential curve has the effect of producing a calibrated value for all machine values observed to date. This implies more realistic behavior for this method than that produced by using quadratic calibration. Moreover, the results from the validation exercise presented in Table 7 indicate that the calibration to an ESP has, indeed, worked in practice.

In conclusion, this cross-calibration exercise has been successful in allowing machine- and version-specific calibration curves to be produced for each instrument at each of the centers involved in the FAMOS study. The validation of this exercise demonstrates that, for multi-center studies where absolute BMD cut-offs are to be produced, this method of cross-calibration will be more successful than the previous methodology of producing standardized BMD values that do not allow adequately for machine and version variance. This method is thus effective for selecting comparative age-matched BMD cohorts in different geographical populations.