Introduction

Bone age is a radiographical assessment used in pediatric medicine due to its relative objectivity compared to chronological age [1,2,3]. For example, it is used in patients with scoliosis to determine if surgical interventions are still viable [4]. It is also used in conjunction with other methods to estimate unknown chronological ages [5]. In orthopedics, bone age is essential for predicting outcomes with leg length discrepancy and determining the appropriate time for an epiphysiodesis [6].

The most common way to determine bone age is through an X-ray of the left hand. There are other models that rely on various skeletal regions, though less popular [6,7,8,9,10,11,12,13,14]. Greulich and Pyle (GP) is one of the most commonly used methods for analyzing hand radiographs [10, 15, 16]. It employs a collection of standardized images of the left wrist and hand at different developmental stages [4, 17]. A radiologist examines the patient’s X-ray and compares it collectively to radiographs in an atlas; based on the closest image’s assigned age, the patient is given a bone age [1, 4, 16, 18].

In recent years, new methods were developed to increase efficiency in bone age analysis. One method known as the shorthand bone age (SBA) was developed in 2013 at Boston’s children’s hospital. It is based on extracting high-yield landmark from GP where the rater looks for specific features on the radiograph such as epiphyseal fusion or sesamoid bone development (Fig. 1). This method is only valid for an age range of 12.5–16 for boys and 10–14 years for girls [1]. The developers of this method claim a significant reduction in analysis time while maintaining a substantial agreement with GP [1].

Figure 1
figure 1

Sample of SBA process of determining bone age [1]. Analysis: Starting at site 1, if the landmark is present, the examiner moves on to landmark two and so on. The last landmark with a fulfilled criteria is the assigned bone age of the patient. Heyworth, Benton E.; Osei, Daniel A.; Fabricant, Peter D.; Schneider, Robert; Doyle, Shevaun M.; Green, Daniel W.; Widmann, Roger F.; Lyman, Stephen; Burke, Stephen W.; Scher, David M. Journal of Pediatric Orthopaedics33(5):569–574, July/August 2013. doi:https://doi.org/10.1097/BPO.0b013e318293e5f2. Image reproduced with permission from Journal of Pediatric Orthopedics.

More recently, an automated algorithm developed by a group radiologists in Toronto achieved the first place in the 2017 Radiological Society of North America (RSNA) machine learning challenge to predict pediatric bone age. This method employs a deep learning algorithm that was trained on over 14,000 images graded based on GP [19]. With this method, a physician can upload an X-ray and the algorithm will estimate the bone age, automating the entire process.

The aim of this study is to evaluate the accuracy and reliability of the two methods described above and examine if the reduction in analysis time compromises their efficacy. We hypothesize that the SBA method performed by inexperienced raters and the automated algorithm will produce values in agreement with GP performed by an experienced radiologist. In addition, we hypothesize high inter-rater and intra-rater reliability of the GP and SBA methods between two inexperienced raters over two trials.

Methods

Participants were selected from the BC Children’s Hospital radiology database, located in Vancouver, British Columbia, Canada. Ethics approval was obtained from the University of British Columbia research ethics board (H18-02756). Two hundred males and 200 females chronologically aged 12.5–16 and 10–14 years old, respectively, were selected. The age restrictions were put in place to match the allowed ranges by the SBA model [1]. Exclusion criteria included diagnosis or evidence of ipsilateral hand or wrist fracture within the past 2 years and diagnosis of endocrine disease that may affect stature such as growth hormone deficiency, congenital adrenal hyperplasia, elevated sex hormones, hypothyroidism, malnutrition, or chronic disease; skeletal dysplasia was also excluded.

The gold standard was determined to be the radiology report attached to each radiograph. These reports are written by experienced radiologists that routinely use GP in their practice delete rely on GPely on GP. In our study, each participant had their bone age determined by two separate raters using both the GP and SBA methods. The raters were a medical student and a science undergraduate student with no radiological or clinical training and no prior experience in bone age assessment. The GP method was performed first as it was more complex and required analysis of the entire radiograph, whereas the SBA only required examining 4–5 landmarks. This is also why the order of methods were not randomized, performing SBA before GP could have allowed the rater to only examine a few landmarks rather than examining the whole radiographs in GP. Patients whose bone age was outside the SBA age range were excluded from our statistical analysis. This reduced our sample size to 150 males and 134 females. Therefore, more radiographs were recruited using the same inclusion and exclusion criteria above until a final sample size of 213 males and 213 females was reached (Fig. 2). Two weeks later, the SBA method was used to analyze all radiographs. Three weeks later, the two raters repeated the analysis of the radiographs using the two methods in the same order. The raters timed themselves using an online stopwatch while analyzing the radiograph on a computer screen. The finally de-identified JPEG versions of the radiographs were securely uploaded to server for automated analysis.

Fig. 2
figure 2

Recruitment process

Statistical analysis

Shapiro-Wilk and Kolmogorov-Smirnov normality tests showed that the distribution of bone ages in all methods and the times for completion were not normally distributed (supplementary table). Intraclass correlation between each method and the gold standard was performed. The median difference and interquartile ranges (IQR) between each method and the gold standard values were obtained in addition to a Wilcoxon signed ranks test to assess statistical significance. For intra-rater variability, intraclass correlation analysis between trial 1 (T1) and trial 2 (T2) for each rater and method was performed. For inter-rater variability, intraclass correlation was performed between rater 1 (R1) and rater 2 (R2) for each method and trial. Finally, the median times for completion and IQRs for each method were obtained.

Results

Combined data

Correlation, median difference, and statistical significance

Intraclass correlation between each method and the gold standard are shown in Table 1. The automated algorithm showed the highest correlation (κ = 0.880), and R1 one using the GP method in trial 1 showed the lowest correlation (κ = 0.825) with the gold standard. The table also shows the median difference and IQR between each method and the gold standard. The methods that show a statistically significant difference are in bold in the table below (P < 0.05).

Table 1 Intraclass correlation, median difference, and IQR between the various methods and raters and the gold standard

Time for completion

The median time for each method and the IQR are shown in Table 2. A box and whisker plot of time and variability for completion is shown in Fig. 2.

Table 2 The median times and IQRs for each method

Female data

Intraclass correlation, median difference, IQR, and statistical significance

Intraclass correlation, median difference, and IQRs for females are shown in Table 3. R2 showed the highest correlation with GP in trial 2 (κ = 0.703), and R1 using GP in T1 showed the lowest correlation (κ = 0.623).

Table 3 Intraclass correlation coefficient, median difference, and IQRs between gold standard and various analysis methods utilized in this study for females

Time for completion

The median time for each method and the IQR for females are shown in Table 4.

Table 4 Median time and IQR for each method in seconds for females

Male data

Intraclass correlation coefficients between gold standard different methods used in this study are shown in Table 5. Median differences, IQRs, and statistical significance between each method and gold standard are also shown in Table 5. Median time for completion and IQRs are shown in Table 6.

Table 5 Intraclass correlation coefficient, median difference, and IQRs between gold standard and various analysis methods utilized in this study for males
Table 6 Median time and IQR for each method in seconds for males

Intra-rater variability

Intraclass correlation between T1and T2 for the same method and same rater are shown in Table 7. There is an excellent correlation between T1 and T2 in all methods indicating high reliability of the two manual methods (SBA and GP).

Table 7 Intra-rater (first half table) and inter-rater (second half table) variability

Inter-rater variability

Correlation between the two raters in each method and trial are shown in Table 7. There is excellent correlation between both raters in each method and trial.

Discussion

The first objective of this study was to examine the accuracy of the SBA method compared to the gold standard. The intra-class correlation analysis showed high correlation with κ = 0.855 for R1 and κ = 0.862 for R2 [20]. This is a stronger correlation than that reported by the developers of SBA [1]. On the other hand, the Wilcoxon signed rank test showed a statistically significant difference between the SBA values and the gold standard. The median difference between SBA and the gold standard was 0.5 years. Clinically, however, this difference may be insignificant. The GP and SBA methods provide bone ages in 0.5–1 year increments. For example, the options for females between the age of 10 and 14 are 10, 11, 12, 13, 13.5, and 14; therefore, a bone age that is 0.5 years higher or lower than the correct value may not be clinically significant. Furthermore, a bone age is considered clinically abnormal if it falls outside 2 standard deviations of the chronological age; standard deviations are calculated and provided in GP atlas [4, 17, 21]. For a 10-year-old female, 2 standard deviations constitute 21.6 months which far outweighs the difference reported here between SBA and the gold standard. The same principle applies for male bone ages. Consequently, even though the SBA method performed by inexperienced researchers produced bone ages that are statistically different from an experienced radiologist using the GP method, these values still lie within the normal clinical ranges for bone age and therefore deeming the method efficacious. In addition, the SBA method does not require the use of an atlas making it easier to access in circumstances/areas where the GP method is not accessible [1]. On the other hand, SBA only addresses females between the ages of 10–14 and males between 12 and 16; thus, if a practitioner works with various ages, the SBA may not be most appropriate method. The SBA was designed to be an efficient, easy to use method; we found that to be true for the most part. However, the time required to use GP was not significantly longer. For example, R1 required a median of 21.83 s for GP vs. 7.00 s for SBA. R2 required 9.30 s for GP vs. 5.00 s for SBA. These time requirements were reduced as the raters gain more experience, leading to a median of 1 min or less per radiograph regardless of the method. It is important to highlight however that an average radiologist will not assess 200 consecutive bone age radiographs within a constricted age range consecutively; therefore, the numbers reported above likely underestimate the time it takes to determine a single bone age. This is supported by the work of Kim et al. (2017) where it required 188 min and 22 s for a radiologist to assess 200 radiographs using GP (~ 0.94 min per X-ray). It required a second radiologist approximately 0.9 min per X-ray. In this study, 100 radiographs were assessed by a radiologist, then a 1-week washout period, and then the second 100 were assessed, whereas in our study, 200 radiographs for one gender were performed in one reading session leading to improved pattern recognition by the rater [22]. This is likely the cause for the discrepancy in reading time between the two studies and the overall shortened time required to assess bone age. In clinical practice, a radiologist may assess 1–2 bone age radiographs, perform other more frequently ordered tasks, then return to a few more bone age scans, etc. In this context, it is likely to require even longer time to assess a single radiograph. Overall, SBA maybe sufficient for physicians who specialize in populations within the allowed age range (i.e., adolescent medicine), but it does not provide a total replacement to GP.

An automated algorithm was chosen in this study due to its potential in diagnostic radiology and bone age in particular. The algorithm provided by 16 bit relied on convolutional neural networks (CNN) and used both the pixel and sex information in the same network. For more technical details of the 16-bit algorithm, see Halabi et al. (2019) [19]. There were multiple algorithms available to conduct this study, and we chose to adopt this algorithm specifically because it achieved the best performance measured by mean absolute error (MAE) in the 2017 RSNA Machine Learning Challenge for Pediatric Bone Age. The MAE achieved was 4.265 months from the gold standard which was established by averaging the reads from 3 pediatric radiologists to the final report (for a total of 4 reads). The training and test data used in the RSNA challenge was the same used by Larson et al. (2018); their algorithm achieved a MAE of 6 months and was published prior to the challenge [23]. The training set consisted of 12,612 images from 2 US pediatric hospitals, and the test set consisted of 200 images.

Work by Lee et al. (2017) tested a variety of CNN architectures. A data set of ~ 8325 radiographs for training and 15% of those for testing, patients age 0–4 were excluded. They reported a female bone age within 1 year 90.39% and within 2 years 98.11% of the time and a male bone age within 1 year 94.18% and within 2 years 99.00% of the time [24]. Similarly, Kim et al. (2017) used a training set of 18,940 images and a test set of 200 images; they achieved a root mean square error (RMSE) of 0.60 years whereas 16 bit achieved RMSE of 0.47 years [22]. Finally, the work by Mutasa et al. (2018) showed a MAE of 0.561 years which is higher than that achieved by 16 bit [25].

In this study, the automated algorithm showed good correlation (κ = 0.880) and a median difference of 0.33 years (~ 4 months) from the gold standard which coincides with the results on the RSNA challenge test set [21]. The difference between this method and the gold standard was statistically significant. Like SBA, this difference is not clinically significant for the same reasons explained above. Considering the good correlation between this method and the gold standard and the 4 months median difference, this algorithm can be considered efficacious, at least for age range examined in this study. There are additional benefits to this process that are not available for SBA and GP. It does not have age restrictions and provides an opportunity to automate the entire process. A technician can simply upload the image to the server and obtain the bone age within seconds. It is unlikely that this method can eliminate the role of the radiologist completely; however, it can streamline the patient’s journey. Currently, a patient may see a physician who refers them for an X-ray. There, an image of the hand is taken and sent to the radiologist who analyzes it and issues a report. With the use of AI, the physician can get the bone age once the patient gets their X-ray taken which can speed up the diagnostic process and improve the patient’s quality of care. One additional benefit to using the 16-bit algorithm compared to other AI algorithms is that it produces an output in months, allowing for bone age values outside the ages set by GP, i.e., 11.5 years, 15.8 years, etc. This increases the accuracy of assessment and is especially useful when the bone age does not match a specific standard.

Upon examining the male and female data separately, the male values correlate with the gold standard similarly to the combined data; however, the female values show lower correlation coefficients. The SBA method yielded values of κ = 0.650 for rater 1 and κ = 0.677 for rater 2 in females. Similarly, method 3 produced a correlation coefficient of 0.699. In addition, the median difference of the SBA from the gold standard in females was 0.68 (~ 8 months) for rater 1, 0.6 (~ 7 months) for rater 2, and 0.65 years (~ 8 months) for method 3, which are higher than the values provided by the combined and male data. One explanation is that there are fewer options of ages for females compared to males in the age ranges allowed in this study. For males between the ages of 12.5 and 16, there are 7 options to choose from, whereas for females between the ages of 10 and 14, there are 6 options. In that sense, there is one less bone age a rater can use to analyze a radiograph. We examined the correlation between method 3 and all the other methods used in this study (except the gold standard) for females, and it was within the range of 0.8–0.9. This could indicate that the gold standard itself has its own limitations and variability [1]. In recent years, there has been evidence of changes in puberty patterns in females over the last century [26, 27]. Future studies should explore the efficacy of GP on current populations and whether it needs to be adjusted.

Intra-rater variability and inter-rater variability

A secondary objective of this study was to examine the intra-rater variability in each method. The correlation between the two trials in each method and rater (i.e., R1M1T1 vs. R1M1T2) was excellent, between the range of 0.9 and 1. This confirms the reliability of these methods. Similarly, with inter-rater variability, we compared each method between the two raters, and we obtained excellent correlation within the range of 0.9–1. This highlights the limited variability between raters who may use these methods.

Limitations

This study relied on GP performed by experienced licensed pediatric radiologists to be the gold standard but did not account for variability between radiologists, although all were experienced radiologists working in an academic pediatric hospital. Due to time, resources, and logistical limitations, this was not possible to obtain a large enough sample size of radiographs assessed by only one radiologist. As a result, we opted to rely on all radiologists as a gold standard. In addition, there is evidence of increased variability in GP with ethnically diverse populations or in children with musculoskeletal conditions [1, 15, 28,29,30]. The efficacy of the automated algorithm was tested within a constricted age range defined by SBA. Examining a wider range was the original goal of this study as it would have allowed to assess the full capability of the 16-bit algorithm. However, it would have required 2 comparison parameters, making the study overly complex. The logical expansion for this research group is to examine the efficacy of 16-bit algorithm across all ages compared to the gold standard. The raters timing themselves could have also created a bias which could have been minimized by using a software for timing; however, this was not possible due to resource limitation.

Only 2 methods were utilized in this study, and compared to GP, there are multiple models (manual and automated) that have been developed in recent years that are worth examining [1, 31,32,33,34]. Tanner-Whitehouse III (TW) is another method that is widely used to assess hand bone age. Originally, TW was included in this study; however, the original text describing this method is out of print and inaccessible; therefore, it was removed [35]. Finally, in this study, 2 raters and 213 males and females were recruited. A higher sample size can be beneficial in establishing better evidence for these methods, and more raters can be useful in understanding the inter-rater variability within each method.

Conclusion

The shorthand bone age method and the automated algorithm produced values that are in agreement with the gold standard Greulich and Pyle while reducing analysis time and maintaining a high inter-rater and intra-rater reliability.