Introduction

Congenital adrenal hyperplasia (CAH) is the generic term for a group of autosomal recessive metabolic diseases involving a deficiency of any of five enzymes responsible for the synthesis of cortisol in the adrenal gland. The most frequent form of CAH, accounting for more than 90 % of all cases, is attributable to a deficiency of the 21-hydrolase enzyme [1]. Life-endangering “salt-wasting syndrome” can ensue if both cortisol and aldosterone synthesis are affected. The variant, which only affects cortisol synthesis, is referred to as “simple virilizing CAH.” It manifests as virilization of variable degree in girls and precocious puberty in boys. Milder, non-classic forms of CAH often remain undetected in males and lead to hirsuitism and impaired fertility in females [2]. The worldwide incidence of CAH has been estimated at 1 in 16,000 births [3].

If left untreated, the overproduction of androgens in CAH also leads to accelerated gain in height. However, this effect is overlaid with accelerated maturation of the skeleton, resulting in short adult stature in most children [48]. Therefore, besides avoiding an Addisonian crisis, attainment of normal adult stature is one of the primary goals in the treatment of CAH. This primarily consists of providing the patient with enough glucocorticoids and mineralocorticoids to suppress ACTH-mediated excess androgen production. Finding the correct dosage poses a challenge because even slight overdoses of glucocorticoids will curb growth over time [6, 811]. A meta-analysis of 18 studies carried out from 1977 to 1997 found a mean adult height following glucocorticoid treatment of −1.37 SD below the respective population mean [10]. Experimental therapeutic approaches include addition of growth hormone [12] or peripheral inhibition of androgen activity and estrogen production.

Treatment of children with CAH involves regular measurement of 17-hydroxyprogesterone as well as bone age and height determination [1]. Some authors consider bone age to be the most useful follow-up parameter after body height, partly due to the fact that 17-hydroxyprogesterone responds quickly to medication, thus giving no indication of a patient’s long-term compliance.

Rating bone age in CAH can require greater than average expertise, especially when bone age is extremely advanced. In times of demands for higher productivity in radiology, radiologic expertise is becoming an increasingly precious resource and the advance of computerized bone age rating methods is therefore welcomed by many as a potential means of relieving radiologists from analyzing large numbers of unremarkable radiographs. The CE (European Conformity)-marked medical device used for this purpose is BoneXpert (Visiana, Holte, Denmark), a software package that can be used on one image at a time in daily routine or, alternatively, to analyze large quantities of hand radiographs in an unattended batch job and whose rating reliability in healthy populations has been demonstrated [1316]. Furthermore, the automated method has been successfully used for children with short stature of various diagnoses [17] or with central precocious puberty [18]. However, it has not to our knowledge been validated in children exposed to androgens since early gestation, such as in the case of children with CAH. Further, our cohort of children with CAH includes some children with extremely advanced bone age due to the fact that they had emigrated from countries where screening for CAH was not institutionalized at the time of their birth. These children are particularly challenging in terms of management, starting with their bone age reading. We were interested to test whether the automated method could cope with radiographs from such children.

The purpose of this study was to determine how automated bone age performs on radiographs of children with CAH compared with human bone age raters.

Materials and methods

Eight hundred and ninety-two left-hand radiographs from 100 children and adolescents aged 0 to 17 years with a diagnosis of CAH who had been treated at our clinic during the period from January 1, 1975, to December 31, 2006, were included in the study. If the films were not already available in digital form as DICOM images, they were scanned with a Vidar Diagnostic Pro Advantage scanner (Vidar, Hemdon, VA). The study was approved by the Tübingen University Hospital Ethics Committee.

Automatic bone age rating was performed using BoneXpert version 1.0 (Visiana, Holte, Denmark, www.BoneXpert.com). BoneXpert calculates bone age based on the shape and appearance of 13 bones of the hand: the phalanges and metacarpals of the first, third and fifth ray and the radius and ulna. The program rejects a bone if its shape is abnormal or if its bone age deviates by more than 2.4 years from the mean of all bones. Furthermore, if fewer than 8 bones are accepted, the entire radiograph is rejected for bone age analysis. The intended Greulich-Pyle bone age range for the automatic rating in its current version is 2 to 15 years for girls, and 2.5 to 17 years for boys. A more detailed description of the calculation methods employed by BoneXpert has been published elsewhere [17, 19].

In 35 examinations, the image was available both as a DICOM file and as a printout on film of the DICOM image. A comparison of the ratings performed by automated bone age on the DICOM files and on the scanned film-printouts yielded differences no greater than 0.5 years, of which 18 were greater than 0.2 years. The mean difference was 0.0 years, i.e. there was no trend for the scanned films to yield older or younger bone age values than the DICOM images. The DICOM file was used whenever one was available.

Validity

Since an objective measure of bone age does not exist, we must resort to various indirect ways of assessing the validity (accuracy) of a new rating method. Thodberg et al. [15] circumvented the problem of the lack of an objective bone age by judging the accuracy of bone age rating on the basis of its ability to predict an objective parameter to which it is related, namely final height. However, this approach is not recommendable in children whose final height may be affected by subsequent treatment. In this paper, we look for deviations from the results obtained with the manual method in terms of bias and standard deviation. This is an example of analysis of agreement between two measurements, so it was natural to use Bland-Altman plots and – in accordance with this concept – to use the standard deviation rather than the correlation. It was tested whether the bias is significantly different from zero (a qualitative result), while the standard deviation was reported as a quantitative endpoint.

Notice that the standard deviation of the differences between the two methods is usually larger than the standard deviation from the line of fit because it ignores the bone age-related slope between the two methods in a Bland-Altman plot.

An additional, quantitative test was included as follows: Images for which BoneXpert bone age (BXBA) and manual bone age (ManBA) differed by more than 1.5 years were rerated by four experienced raters. Rerating was done without knowledge of ManBA, BXBA and chronological age. The mean of the four bone age values, referred to as the ReferenceBA, was compared anew with BXBA and ManBA. The same approach has been used previsouly [17, 18]. ReferenceBA acts as a secondary and very reliable outcome. The performance of BXBA and ManBA in terms of their deviation from ReferenceBA was calculated using the 1-sample proportions test with continuity correction, taking into account that only observations from different subjects were statistically independent.

Precision

By contrast, there is no need for an objective measure of bone age to determine the precision of a new bone age rating method, i.e. its ability to generate reproducible results. One way to determine this is to study the smoothness of longitudinal curves obtained. In the present study, the precision of automatic and manual rating was assessed in terms of the smoothness of longitudinal curves using the triplet method [19]. This involves breaking down individual longitudinal bone age series with n bone age measurements into n – 2 triplets of consecutive bone age measurements and considering the residual between the middle bone age and the linear interpolation between the two measurements on either side, for each triplet. The triplet method assumes that the three bone age measurements lie on a straight line as a function of age if there is no precision error. Hence, any deviation from the line is interpreted as due to precision error of the bone age method. Since in reality many more factors lead to a deviation from a straight line, the triplet method yields an upper limit of the true precision. To improve the estimate, we only considered triplets spanning less than 1.7 years. This left us with 327 triplets out of the original 604 (54 %). The result of the precision analysis is both quantitative and qualitative. The quantitative result was the estimated value of the precision of the automated and manual methods given with confidence intervals. In addition, we compared the precision of manual and automated methods and tested whether they were significantly different, i.e. a qualitative result.Footnote 1

To further illustrate the behaviour and precision of automated bone age compared with manual bone age rating in children with extremely deviant bone age, we plotted the longitudinal course of three children whose skeletal maturity was particularly advanced. For this we selected, from the subset of children of whom at least 12 images were available, the three with the most advanced bone age (mean of ManBA and BXBA minus CA) at their first visit.

Statistical calculations were performed using the JMP 9 software package (SAS Institute Inc., Cary, NC) and version 2.12.1 of the R statistics software (www.r-project.org).

Results

Analysis of rejected images

One hundred sixteen of the 892 images were rejected by BoneXpert. In 111 of these, the rejection was due to bone age being below BoneXpert’s specified rating range, i.e. ManBA below 2.0 years in girls or below 2.5 years in boys. For the remaining five images, three were due to poor image quality, one to improper scanning, and one remains unexplained and thus represents the inefficiency of the automated method.

Comparison between ManBA and BXBA

For the 776 images (480 from girls, 296 from boys; Table 1) analyzed by automated bone age, the mean difference BXBA – ManBA was −0.02 years (N.S.), the slope of the line of fit in a Bland-Altman plot was negligible at −0.02 years/year and the standard deviation (SD) of the signed differences was 0.72 years. The mean of the absolute (unsigned) differences was 0.54 years (SD 0.40 years). In 20 images, the absolute difference between BXBA and ManBA was greater than 1.5 years (Fig. 1).

Table 1 Characteristics at the time of the first radiograph (children whose radiographs were accepted by BoneXpert): mean chronological age and manual bone age (ManBA) by gender
Fig. 1
figure 1

Bland-Altman plot of the relation between automatic (BXBA) and the original manual rating (ManBA) in boys (circles) and girls (dots). The difference between the two methods is shown against the mean of the two methods. The dotted lines indicate the range of discrepancy <1.5 years and the grey line indicated the regression. The 20 radiographs that differed by >1.5 years are analyzed in Table 2

These images were submitted for blind rerating by four raters (two radiologists and two pediatric endocrinologists – all having between 10 and 30 years of practice in bone age rating). None of the rerated images showed an absolute difference between ReferenceBA and BXBA greater than 1.5 years. ReferenceBA was closer to ManBA in 2 images and closer to BXBA in 18 (Table 2). To estimate the statistical significance of this observed advantage of BXBA, we note that the 20 rerated images are from ten children, so we have ten rather than 20 independent observations, and the two cases where ManBA is better than BXBA occur in children with three visits, so for these two children, ManBA was better than BXBA in one-third of the visits. Thus, we have observed BXBA to be better than ManBA in 9.33/10 independent cases (93 %). We now take as null hypothesis that ManBA and BX are equally close to the ReferenceBA and a proportion test then shows that BXBA is closer to ReferenceBA than is ManBA with P=0.02. (The test for the proportion 1 in 10 gives P=0.027 and for 0 in 10 gives P=0.004, and the quoted p-value is the interpolation to the proportion 0.67 in 10.) Notice also that in this computation of statistical significance it is irrelevant how many images and subjects there were in the total study, prior to selecting the disputed cases.

Table 2 Bone age rating results for the 20 images with a difference between the original manual rating (ManBA) and automatic rating (BXBA) greater than 1.5 years. These were subsequently rerated (ReferenceBA)

Our analysis of bone age rating precision based on the smoothness of longitudinal curves comprised a total of 327 data triplets spanning less than 1.7 years for both ManBA and BXBA. The following precision results were obtained: ManBA: 0.32 years (95 % CI: 0.29–0.35); BXBA: 0.21 years (95 % CI: 0.19–0.23). This indicates a significant difference in precision (P<0.001).

Figure 2 shows the longitudinal BXBA and ManBA curves of three children who had been selected for their extreme skeletal prematurity as described above (mean bone age advancement: 3.1 ± 3.0 years). None of the 20 images with a bone age discrepancy greater than 1.5 between ManBA and BXBA was from any of these 3 children. The SD of the signed differences between BXBA and ManBA for these three curves was 0.63 years. A comparison of the smoothness of the longitudinal curves generated by ManBA and BXBA for these 3 children (33 triplets spanning less than 1.7 years) yielded precision values of 0.24 years (95 % CI: 0.17–0.32) for ManBA and 0.16 (95 % CI: 0.12–0.23) years for BXBA (N.S.).

Fig. 2
figure 2

Longitudinal charts of BoneXpert bone age (black) and manual bone age (grey) in 3 children selected according to the following criteria: At least 12 images available per child; of these, the 3 children with the most advanced bone age at diagnosis

The difference between automatic and the original manual bone age rating as a function of bone age advancement (Fig. 3) shows a slope of 0.1 years/year (x-intercept = +0.7 years). This means that BoneXpert tends to produce slightly lower ratings than the manual rating with increasing bone age advancement.

Fig. 3
figure 3

Difference between automatic (BXBA) and the original manual (ManBA) bone age rating in boys (circles) and girls (dots) is shown as a function of bone age advancement in CAH

Discussion

Rejected images – BoneXpert’s behaviour in the low bone age range

BoneXpert’s inability to rate images below bone age 2 in girls and 2.5 in boys was a greater limitation in this study than it had been in other clinical studies on BoneXpert [17, 18]. In contrast to those studies, the majority of children in the present study were regularly monitored for bone age from birth, which is when CAH is usually diagnosed. Thus, in 111 of the 116 images rejected this was due to low bone age, and the overall rejection rate was greater than 10 %. In a study on short stature and in another on central precocious puberty, the rejection rate was less than 1.5 % (14/1,097 and 9/732, respectively) [17, 18]. Extending BoneXpert’s application range further towards birth may be a worthwhile project in view of the present findings.

Interobserver error between automated and manual ratings

At 0.72 years, the standard deviation of the signed differences between the ratings of BoneXpert and ManBA was in the same range as in analogous studies on other pathologies [17, 18] where standard deviations ranged from 0.71 to 0.8 years. The mean of the absolute differences (0.54 years) compares favourably with the levels of interobserver error between human raters reported in the literature. Berst et al. [20] give 0.69 ± 0.48 years for the mean of the absolute differences between two trained observers performing bone age ratings on 107 radiographs in awareness of CA. Expressed in terms of the SD of the signed differences and assuming a normal distribution, this equates to 0.69/0.8 = 0.86 years. King et al. [21] reported bone age readings of 50 radiographs performed by each of three raters where the SD of the signed differences was 0.80 years [20].

Rerating results: implications for automated and manual rating

In 20 images, the discrepancy between manual and automatic rating was greater than 1.5 years. These were each blindly rerated by four independent raters. The mean of these four ratings (ReferenceBA) deviated by less than 1.5 years from BXBA for all images.

We can conclude from our results that the original discrepancies in the 20 rerated images were due more to manual errors than to errors in automatic rating. This is indicated by 18 reratings being closest to the automatic rating and 2 closest to the original manual rating, as well as by the smaller mean absolute difference between BXBA and ReferenceBA as compared to that between ManBA and ReferenceBA. If ReferenceBA is assumed to represent the “true bone age,” then BXBA was significantly more accurate than ManBA (P=0.02) in rating these 20 images. This is supplemented by the outcome of our comparison of the smoothness of longitudinal curves, which clearly showed the automatic rating method to be more precise, despite the fact that the automated method rated the images independently whereas the original manual raters usually had the previous rating and the age of the child available. This tends to enhance smoothness and improve the precision result derived for manual rating.

Performance of BoneXpert in children with extremely advanced bone age

Our comparison of ratings performed on children with far advanced bone age suggests that this is not a specific source of discrepancy between manual and automatic rating in CAH, since the SD of the signed differences between ManBA and BXBA for this subgroup was even smaller – albeit nonsignificantly – than they were for all children taken together.

Conclusion

BoneXpert supplies satisfactory bone age ratings in children with CAH within its designated bone age application range (bone age: 2–15 years for girls, 2.5–17 years for boys). The high rate of image rejection found in younger children underscores the need to extend the programme to infants.