The uncovered biases and errors in clinical determination of bone age by using deep learning models

Bai, Mei; Gao, Liangxin; Ji, Min; Ge, Jianbang; Huang, Lingyun; Qiao, HaoChen; Xiao, Jing; Chen, Xiaotian; Yang, Bin; Sun, Yingqi; Zhang, Minjie; Zhang, Wenjie; Luo, Feihong; Yang, Haowei; Mei, Haibing; Qiao, Zhongwei

doi:10.1007/s00330-022-09330-0

The uncovered biases and errors in clinical determination of bone age by using deep learning models

Imaging Informatics and Artificial Intelligence
Published: 20 December 2022

Volume 33, pages 3544–3556, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

European Radiology Aims and scope Submit manuscript

The uncovered biases and errors in clinical determination of bone age by using deep learning models

Download PDF

Mei Bai¹,
Liangxin Gao²,
Min Ji¹,
Jianbang Ge²,
Lingyun Huang²,
HaoChen Qiao³,
Jing Xiao²,
Xiaotian Chen⁴,
Bin Yang¹,
Yingqi Sun¹,
Minjie Zhang¹,
Wenjie Zhang⁵,
Feihong Luo⁶,
Haowei Yang¹,
Haibing Mei⁷ &
…
Zhongwei Qiao ORCID: orcid.org/0000-0002-3244-8534¹

605 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Objectives

To evaluate AI biases and errors in estimating bone age (BA) by comparing AI and radiologists’ clinical determinations of BA.

Methods

We established three deep learning models from a Chinese private dataset (CHNm), an American public dataset (USAm), and a joint dataset combining the above two datasets (JOIm). The test data CHNt (n = 1246) were labeled by ten senior pediatric radiologists. The effects of data site differences, interpretation bias, and interobserver variability on BA assessment were evaluated. The differences between the AI models’ and radiologists’ clinical determinations of BA (normal, advanced, and delayed BA groups by using the Brush data) were evaluated by the chi-square test and Kappa values. The heatmaps of CHNm-CHNt were generated by using Grad-CAM.

Results

We obtained an MAD value of 0.42 years on CHNm-CHNt; this result indicated an appropriate accuracy for the whole group but did not indicate an accurate estimation of individual BA because with a kappa value of 0.714, the agreement between AI and human clinical determinations of BA was significantly different. The features of the heatmaps were not fully consistent with the human vision on the X-ray films. Variable performance in BA estimation by different AI models and the disagreement between AI and radiologists’ clinical determinations of BA may be caused by data biases, including patients’ sex and age, institutions, and radiologists.

Conclusions

The deep learning models outperform external validation in predicting BA on both internal and joint datasets. However, the biases and errors in the models’ clinical determinations of child development should be carefully considered.

Key Points

• With a kappa value of 0.714, clinical determinations of bone age by using AI did not accord well with clinical determinations by radiologists.

• Several biases, including patients’ sex and age, institutions, and radiologists, may cause variable performance by AI bone age models and disagreement between AI and radiologists’ clinical determinations of bone age.

• AI heatmaps of bone age were not fully consistent with human vision on X-ray films.

Deep learning-based automated bone age estimation for Saudi patients on hand radiograph images: a retrospective study

Article Open access 01 August 2024

Artificial intelligence-assisted interpretation of bone age radiographs improves accuracy and decreases variability

Article 01 August 2018

Validation of an AI-Powered Automated X-ray Bone Age Analyzer in Chinese Children and Adolescents: A Comparison with the Tanner–Whitehouse 3 Method

Article 31 July 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Bone age (BA) assessment is an interpretation of skeletal maturity from the X-ray of the left hand. The BA value estimated is a doctor’s reference in children’s health care or other circumstances, e.g., forensic analysis and sports medicine [1]. Radiologists usually make a BA report based on the Greulich-Pyle (G&P) atlas; this method is one of the popular methods for BA assessment [2]. In clinical practice, BA assessment includes the BA value and clinical determination of BA. The patient’s BA value is assigned by best matching his left hand and wrist radiograph with a reference standard image from the G&P atlas. The clinical determinations based on BA value include advanced, normal, and delayed skeletal development. The Brush data are used to define the skeletal development condition. Some studies have suggested that AI has a potential advantage over humans in BA assessment because BA is a quantitative value and therefore is an ideal target for automated image evaluation [3, 4].

Deep learning, known as a subtype of machine learning, has shown high accuracy in performing different tasks for medical image analysis [5]. In recent years, many novel approaches based on deep learning have been utilized for BA assessment [6]. The Radiological Society of North America (RSNA) Pediatric Bone Age Machine Learning Challenge [7] was launched at the 2017 RSNA Annual Meeting, and with a low mean absolute difference (MAD) ranging from 4.265 to 4.907 months for the 10 best teams, the result of the challenge demonstrated the success of machine learning in BA assessment [8].

Generally, BA assessment is affected by ethics, region, economic status, and nutrition. Deep learning model training using image data from various settings or patient populations may be able to mitigate the generalization problem [9]. The problem of generalization is that a model trained in some situations cannot make the same accurate prediction in new ones. However, at this point, few papers have addressed the generalization of BA models by comparing single and joint data sources (institutions). Few studies have evaluated the factors, including patients, radiologists, and clinical determination, on the effect of AI models. Few papers on using deep learning for BA assessment have addressed the Brush data. These papers evaluate the performance of BA models in terms of MAD values but rarely evaluate the differences between human and machine clinical determinations of BA by using Brush data.

In this study, we established three AI models: (1) the USA model (USAm) from the publicly available RSNA dataset, (2) the CHN model (CHNm) from the dataset of the National Children’s Medical Centre in China, and (3) the JOI model (JOIm) from the mixed dataset from the above two models. This study aimed to evaluate AI performance in assessing BA and the effects of patient sex and age, data site differences, interpretation bias, and interobserver variability on AI performance. We further assessed the agreement between AI and radiologists’ clinical determinations of BA. Because AI estimations of BA are a black box [10], we used the heatmaps to observe the AI vision on the X-ray films compared with the human behaviors in medical procedures.

Methods

The workflow chart in this study is shown in Fig. 1. The steps included model design, statistics, and heatmap generation.

Data acquisition

Our ethics committee approved this retrospective study and waived the requirement for informed consent. After excluding abnormal images and reports, 12,472 radiographs of the left hand and wrist were originally retrieved from our children’s hospital between July and September 2018 and used for our deep-learning model (CHNm) and test data (CHNt). All DICOM left hand and wrist radiographs, radiology reports, radiologists’ names, and the sex and chronological age (CA) of patients were exported from the Picture Archiving and Communication System. Images were labeled by BA value; BA values were extracted from the radiology reports. All radiology reports were provided by pediatric radiologists with more than 10 years of experience with reference to the paper-based Greulich-Pyle atlas (second edition) [2]. Ten senior pediatric radiologists took part in evaluations using CHNt (n = 1246). Their years of working experience in interpreting and reporting radiographs were 37 years (for D1, who reviewed 524 images), 33 years (for D2, who reviewed 127 images), 20 years (for D3, who reviewed 2 images), 18 years (for D4 and D5, who reviewed 109 and 69 images, respectively), 16 years (for D6 and D7, who reviewed 60 and 214 images, respectively), 15 years (for D8, who reviewed 82 images), 11 years (for D9, who reviewed 23 images), and 10 years (for D10, who reviewed 36 images). To evaluate the effect of interobserver variability on CHNm-CHNt performance, disputed cases outside the 95% limits of agreement (LOAs) of the difference between the AI and radiology BA reports were rerated by another radiologist with 10 years of experience, and then we took the average of the reports and rerated the BA of disputed cases as a new manual BA.

A total of 10,667 images from the RSNA dataset [7] were used as the USA model (USAm) and USA test data (USAt) after excluding some images with additional artifacts and missing parts of hands. We further mixed data from CNHm and USAm together to implement the third AI model (JOIm). Image numbers and demographic data for all datasets are shown in Table 1.

Table 1 Summary information for three BA models with data from China (CHNm), America (USAm) and joint (JOIm), and two test datasets from China (CHNt) and America (USAt)

Full size table

Data preprocessing

The first task of the preprocessing pipeline was to extract the hand bone region in the X-ray radiographs. To automatically generate the hand mask, the U-Net [11] network architecture originally suggested for image segmentation was employed. We manually annotated 200 hand bone masks by using an online annotation service as the training dataset. In the training phase, we used the optimized Dice loss function as the target of optimizing the segmentation network.

Second, we aligned the important region of the hands into a common coordinate space. Therefore, we detected the coordinates of several specific key points of a hand for this purpose. ResNet [12] network was used as the feature extraction backbone network for extracting location information. The output was 6 coordinates corresponding to three sets of key points: tip of the distal phalanx of the third finger, tip of the distal phalanx of the thumb, and center of the capitate (Fig. 2). We used the mean square error loss function to train our landmark detection model.

Bone age AI model

We randomly divided the preprocessed data into a training set, validation set, and test set at a ratio of 8:1:1. The attention module CBAM was integrated into the inception V3 network to improve the network’s feature extraction capability (Fig. 3). Then, we concatenated the feature with the patient’s sex, which was encoded and mapped to [0, 1] before inputting the network, where 0 means male and 1 means female. Finally, we used two fully connected layers to further regress out the BA. After each convolution, batch normalization (BN) [13] and a rectified linear unit (RELU) [14] were applied. Dropout was used in the fully connected layer at a rate of 0.5.

In the training process, we employed the mean absolute error loss function as an optimization goal to train the BA regression model. Adam [15] updated the weight with a learning rate of 0.01 in the initialization phase and gradually decayed the learning rate as the epoch increased to obtain a better convergence effect.

Regional heatmap activation

We utilized the Grad-CAM [16] method to generate heatmaps to determine which part of an image was locally significant for fine-grained classification, as we rarely learned from existing clinical programs (e.g., G&P [2] and TW3 [17]). To utilize Grad-CAM, we extracted the feature from the last convolution layer of the network.

BA clinical determination

BA values from radiology reports and AI models tested on CHNt were analyzed by using Brush data to classify BA diagnoses [2]. The Brush data were used to classify a BA as normal (if the BA was limited to the range of ± 2 SD of the CA), delayed (if the BA was lower than −2 SD of the CA), or advanced (if the BA was higher than + 2 SD of the CA). Brush data reflect the variability of BA and are widely accepted as clinical determinations of BA. We classified BA diagnoses into three groups and performed a statistical analysis comparing human and AI estimations of BA in terms of Kappa values.

Statistical analysis

Bland–Altman plots and 95% LOA (mean ± 1.96 SD) were created to illustrate the BA difference between AI estimations of BA and reported BAs. Pearson correlation analysis was used to assess the correlation of AI-determined BAs and reported BAs. The performances of all AI models were evaluated in terms of mean values, the standard deviation (SD), the MAD, and the root mean square error (RMSE) of differences between the AI determinations of BA and reported BAs. The accuracies of all AI models were assessed using percentages of cases with the values of the absolute difference between AI and reported BAs within 0.5 years, 1 year, and 2 years.

When data were not distributed normally, nonparametric alternatives were used for comparing two or multiple groups (different sexes, CAs, radiologists, and clinical determinations), i.e., the Mann–Whitney U test or Kruskal–Wallis test, respectively. The classifications of BA diagnoses were analyzed by the chi-square test. The agreement of human and AI estimations was calculated by kappa values.

The statistical analyses were performed using SPSS 17.0 (SPSS Inc.). Differences were considered significant at p < .05. The figures were drawn using SPSS and GraphPad Prism v 5.0 software (GraphPad Software Inc.).

Results

Performance of deep learning models

The differences between BA estimations by three AI models tested on CHNt and USAt and radiologists’ reports of BA values are shown in the Bland–Altman plot (Fig. 4a, b, c, d, e and f) with mean bias and 95% LOA. The percentages of scattered dots outside the 95% LOA were lowest on CHNm-USAt with 4.0% (42/1060) in Fig. 4 d but highest on USAm-USAt with 6.1% (65/1060) in Fig. 4e. The limits from the upper to the lower line of the 95% LOA were narrowest on CHNm-CHNt in Fig. 4a (−1.129 to 1.058 years) but broadest on CHNm-USAt in Fig. 4d (−2.285 to 1.664 years).

BAs determined by AI on CHNm-CHNt, USAm-USAt, JOIm-CHNt, and JOIm-USAt (all r = 0.98) showed a stronger correlation (linear) with reported BAs than BAs determined by AI on USAm-CHNt (r = 0.96) and CHNm-USAt (r = 0.95).

The results of the internal validation (CHNm-CHNt, USAm-USAt, JOIm-CHNt, JOIm-USAt) and the external validation (CHNm-USAt, USAm-CHNt) were analyzed to evaluate AI performance. The internal validation is that the training and test datasets are from the same institution. The external validation is that the training and test datasets are from the two separate institutions. Table 2 shows the summary statistics of the accuracy of CHNm, USAm, and JOIm tested on CHNt and USAt. In terms of the MAD, RMSE, and accuracy (with percentage) of the difference between AI and reported BAs within 0.5 years, 1 year, and 2 years, CHNm-CHNt outperformed USAm-CHNt, whereas USAm-USAt outperformed CHNm-USAt. Fortunately, better performances of JOIm validated on the two test datasets were obtained, and they were similar to internal validations separately. This bias may come from some factors, as we show in the following.

Table 2 Summary statistics of the difference between AI and reported BAs

Full size table

The distribution of the absolute differences (ADs) between reported BAs and BAs determined by three AI models tested on CHNt with patients’ sex and CA, radiologists, and clinical classifications of BA diagnoses are shown by box plots (Fig. 5 a, b, c and d). All AD values were calculated as medians because of the non-normal distribution. In Fig. 5a and Table 2 (rows 3–5), the medians of AD values among females were lower than those among males on CHNm-USAt (0.66/0.83 years, p < .001) and USAm-CHNt (0.50/0.75 years, p < .001), higher than those among males on JOIm-USAt (0.45/0.31 years), but similar to those among males in other models (p >.05). This result indicated that sex has a varied effect on the accuracy of BA estimation. Figure 5b shows that the values were larger for USAm than for CHNm and JOIm for extremely small CAs (2–5 years), small for middle CAs (6–14 years), and large for extremely large CAs (15–17 years) in all three models. The image numbers were very small (< 40 cases) for the small and large CA groups but relatively large for the middle CA group. In Fig. 5c, USAm had a larger BA difference regarding all radiologists (p = .023) than CHNm and JOIm (both p > .05). The medians of AD values of AI and reported BAs for all radiologists ranged from 0.25 to 0.42 years on CHNm, 0.42 to 0.96 years on USAm, and 0.25 to 0.79 years on JOIm. In Fig. 5d, the normal group presented a smaller BA difference than the advanced and delayed groups. The highest performance was observed for CHNm-CHNt in the normal group (0.39 years), and the lowest performance was observed for USAm-CHNt in the delayed group (1.02 years).

The effect of interobserver variability of disputed cases on CHNm performance was analyzed.

Sixty-nine disputed cases were outside the 95% LOA (> 1.022 years, < −1.166 years) of the difference between AI-determined BA on CHNm-CHNt and reported BA in Fig. 4a and were rerated. The SD of the difference between the rerated BA and reported BA was 0.88 years. The average rerated BA and reported BA as a new manual BA was 10.73 years, which was closer to the mean AI-determined BA (10.70 years) than the reported BA (10.59 years). The proportion of disputed cases where the new manual BA agreed better with the CHNm-CHNt AI-determined BA was 65.2% (45/69). This finding is interesting and lead us to consider whether AI would outperform individual doctors in estimating BA.

Regional heatmap

The values of hot spots in the heatmaps of every radiograph can be transformed into the range of 0–1. We intentionally separated the values into groups according to patients’ sex and age in accordance with the GP atlas, and in every group, we obtained the average value to better show each group characteristic. In Fig. 6, the first row shows a typical X-ray film. The second row shows the heatmap pictures, the third shows the SD values, and the fourth shows the variation values.

These heatmaps showed the AI vision, which may explain the black box of AI. For younger children, such as the 5-year-old male group (column 1 in Fig. 6), the heatmap focused more on the phalanxes and less on the carpals, but radiologists focused more on the carpals according to the GP atlas. For older children, such as the 14-year-old male group (column 3 in Fig. 6), the heatmap focused moderately more on the carpals, but radiologists focused more on the metacarpals and the radius. For the hands of 18-year-old boys, most AI hot spots were shown on the carpal area. For children in the middle-age group, for example, for 8-year-old boys, AI focused more on the phalanxes and moderately more on the carpals, but radiologists focused more on the phalanxes, metacarpals, and carpals. This result indicated that AI and humans might focus on different regions on the hand bone when making a decision on BA estimation.

BA clinical determination

The distributions of BA diagnosis of 1246 test images from CHNt are shown in Table 3. As indicated by the p values, compared with the radiology report, the clinical classifications of CHNm and JOIm (both p > .05) showed no difference; however, those of USAm (p < .001) showed a difference. The results of CHNm and JOIm in evaluating BA would seem to be perfect. We further individually matched the 1246 subjects and showed that the Kappa values were 0.714 on CHNm, 0.716 on JOIm, and 0.53 on USAm (p < .001), as shown in Table 4. Our results indicated that there was no high agreement between these models and radiologists.

Table 3 The clinical classifications of bone age diagnosis by using CHNt (test dataset from China, N=1246) for three models and radiology reports

Full size table

Table 4 The kappa values of clinical classifications of bone age diagnosis of CHNt (test dataset from China, N = 1246) for three models and radiology reports

Full size table

Discussion

In this study, we evaluated the performance of bone age deep learning models established by using our hospital clinical data and RSNA data. In our study, the best MAD value was 0.42 years on CHNm-CHNt. This finding was as accurate and precise as those of previous studies, in which the MADs ranged from 0.38 to 0.64 [18,19,20,21,22,23,24]. However, the performance of BA models was worse with external validation (CHNm-USAt and USAm-CHNt) than with internal validation (CHNm-CHNt and USAm-USAt). The results were consistent with the studies of Larson [19] and Koita [23].

An interesting finding of this study is the heatmaps of hands. We found that AI heatmaps were not fully consistent with human focusing areas based on the GP atlas. This finding indicated on what areas and how AI focuses to some extent; this focus has not been described in previous studies. We are not the first to report heatmaps on the AI BA model [21]. However, we showed a different heatmap from previous findings, with a heatmap generated by the whole hand, not by several partial hand regions, as shown in the previous study.

Regarding the clinical determination of BA in the normal, advanced, and delayed groups, our study and Larson [19] both found that the chi-square test showed no difference between AI and human clinical determinations of BA. However, further analysis of kappa values indicated no high agreement (kappa 0.714) between AI and humans. This result questions the performance of AI in BA diagnosis. Moreover, USAm tested on external data showed the worst agreement between AI and humans, with the broadest limit on the Bland–Altman figure and lowest kappa values (0.53). This result further reflects the generalization problem when AI faces external and new data.

The effects of patients, institutions, and radiologists on AI performance were also assessed. We observed several biases between AI and radiologists; these biases include children’s sex and age, institutions, and radiologists. Apparently, the data biases mentioned above may cause variable AI performance by the three BA models and disagreement in BA diagnosis between AI and radiologists.

As the GP atlas shows, males and females of the same age have different BA atlases. Children of the same age have different characteristics of the BA atlas, and the same characteristics may belong to different ages. Lee et al found that a higher level of MAD errors is seen for the female cohort in a sex-aware model; this finding may suggest that a relatively higher growth rate of the female cohort causes greater deviation from the nominal growth trajectory for individual subjects [25]. Compared to Lee’s finding, our results are varied. We observed a significant difference in MAD values between males and females for the three test datasets (CHNm-USAt, USAm-CHNt, and JOIm-USAt). These results are shown in Table 2 in the results section (“*” means p < .05). A lower MAD indicates higher AI performance on the females of CHNm-USAt, the females of USAm-CHNt, and the males of JOIm-USAt. A higher MAD indicates lower performance on the males of CHNm-USAt, the males of USAm-CHNt, and the females of JOIm-USAt. The capability of AI performance is not consistent with the sex classification.

The worse performance of BA models with external validation implies that the AI model is not fitted across different sites [9]. This institutional bias is due mainly to the different physical characteristics of the population from different institutions. For example, some studies demonstrated the application of the GP atlas to assess bone age in children of diverse ethnicities [26] and indicated cross-racial growth differences between Asian and White children [27]. Our results indicated better validation in internal datasets but poor validation across “institutions,” mainly because of the different populations. Our JOIm implemented with combined data from China and America showed better performance. Mutusa et al joined private and public data and built a BA model [24] and an increasing number of datasets from different institutions to solve the generalization problem across institutions.

Our results also showed larger differences between AI and radiology reports in the abnormal BA group than in the normal BA group possibly because skeletal maturation inconsistencies in carpals and tubular bones in disease conditions make interpreting BA with reference to the GP atlas, which was based on a typical child’s bone structure, more difficult. These conditions include growth hormone deficiency [28], obesity [29], and chronic renal insufficiency [30].

Our study has three limitations. First, the population distribution in this study is a limitation and challenge. The distribution of males and females was not even, nor was the age distribution. For example, the chronological ages of the CHN training dataset showed an approximate Gaussian distribution. The 10-year-old population accounted for 18.5% and was the peak. However, the numbers of younger and older children in this dataset were lower. This age distribution would affect the accuracy of the model. Second, the “labeling rule” is another limitation of our study. The test data CHNt (n = 1246) were labeled by ten senior pediatric radiologists. The interobserver variability would be a limitation. This variation could be decreased by averaging two or more reads. In our study, 69 disputed data were rerated, and the average was closer to AI than to either of the reads. This finding indicates that rerating the BA of radiographs may help improve the AI determination of BAs. The result is supported by Van Rijn [3] and Mutasa [24]. Third, we think the clinical determination of bone age may be a limitation and value in this study. The human–machine comparative performance aspect of this study was presented in terms of not only the MAD value but also the classification of child development. The clinical determination of bone age is based only on the BA value from the wrist and hand radiographs, chronological age, and SD value from Brush data. More clinical information from pediatricians is needed to evaluate pediatric growth conditions.

In the future, our study will focus on assessing pediatric growth conditions, not just bone age assessments. Because pediatric growth conditions are not assessed just by bone age, some clinical history and lab data are very useful. Classification of BA diagnosis was made only by Brush data and radiology reports, which may not be accurate for assessing child development without clinical data and other physical examinations. The data science Venn diagram by Drew Conway [31] indicated that there was a danger zone when big data were mined by using domain knowledge and hacking skills. Our future study will focus on the collection of clinical data of BA diagnosis and visual saliency maps to provide “explicability” for AI to improve the accuracy of BA prediction.

The deep learning models outperformed external validation in predicting BA on both internal and joint datasets. However, the AI models’ clinical determinations of bone age were not in high agreement with the clinical determinations by radiologists. Several factors, including patients’ sex and age, institutions, and radiologists, contributed to the bias of AI performance. The heatmaps of bone age were useful in clarifying how AI made decisions.

Abbreviations

AD:: Absolute difference of bone age values between AI and radiologists
BA:: Bone age
BN:: Batch normalization
Brush data:: The variability of skeletal age in the Brush Foundation Study of Human Growth and Development, led by Professor T. Wingate Todd
CA:: Chronological age
CHNm:: Bone age model from a Chinese private dataset (11226 images from our hospital in 2018)
CHNt:: Chinese test dataset (1246 images from our hospital in 2018)
JOIm:: Joint model with combined data from the CHN model and the USA model
JOIt:: Joint test dataset from China and America data
LOA:: Limits of agreement
MAD:: Mean absolute difference of bone age values between AI and radiologists
MAE Loss:: Mean absolute error loss function
MSE Loss:: Mean square error loss function
PACS:: Picture Archiving and Communication System
RELU:: Rectified linear unit
RMSE:: Root mean square error
SD:: Standard deviation
USAm:: Bone age model from an American public dataset (9607 images from the 2017 RSNA Pediatric Bone Age Machine Learning Challenge)
USAt:: American test dataset (1060 images from the 2017 RSNA Pediatric Bone Age Machine Learning Challenge)

References

Creo AL, Schwenk WF 2nd (2017) Bone age: a handy tool for pediatric providers. Pediatrics 140(6):e20171486
Article PubMed Google Scholar
Greulich WW, Pyle SI (1959) Radiographic atlas of skeletal development of the hand and wrist, 2nd edn. Stanford University Press, Stanford, California
Google Scholar
Van Rijn RR, Thodberg HH (2013) Bone age assessment: automated techniques coming of age? Acta Radiol 54:1024–1029
Article PubMed Google Scholar
Lee H, Tajmir S, Lee J et al (2017) Fully automated deep learning system for bone age assessment. J Digit Imaging 30:427–441
Article PubMed PubMed Central Google Scholar
Summers RM (2018) Deep learning lends a hand to pediatric radiology. Radiology 287:323–325
Article PubMed Google Scholar
Nadeem MW, Goh HG, Ali A, Hussain M, Khan MA, Ponnusamy VAP (2020) Bone age assessment empowered with deep learning: a survey, open research challenges and future directions. Diagnostics (Basel) 10:781
Halabi SS, Prevedello LM, Kalpathy-Cramer J et al (2018) The RSNA pediatric bone age machine learning challenge. Radiology 290:498–503
Article PubMed Google Scholar
Siegel EL (2018) What can we learn from the RSNA pediatric bone age machine learning challenge? Radiology 290:504–505
Article PubMed Google Scholar
Yasaka K, Abe O (2018) Deep learning and artificial intelligence in radiology: current applications and future directions. PLoS Med 15:e1002707
Article PubMed PubMed Central Google Scholar
Choy G, Khalilzadeh O, Michalski M et al (2018) Current applications and future impact of machine learning in radiology. Radiology 288:318–328
Article PubMed Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-Net convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 234-241. Available via https://springerlink.bibliotecabuap.elogim.com/content/pdf/10.1007/978-3-319-24574-4_28.pdf. https://doi.org/10.48550/arXiv.1505.04597
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for Image Recognition2016 IEEE Conference on computer vision and pattern recognition (CVPR), pp 770-778. Available via https://arxiv.org/abs/1512.03385. https://doi.org/10.48550/arXiv.1512.03385
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shiftProceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37. JMLR.org, Lille, France, pp 448–456. Available via https://arxiv.org/abs/1502.03167. https://doi.org/10.48550/arXiv.1502.03167
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machinesProceedings of the 27th International Conference on International Conference on Machine Learning. Omnipress, Haifa, Israel, pp 807–814. Available via http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=33D549C0E5AD0F9F9593A3FDE2309E35?doi=10.1.1.165.6419&rep=rep1&type=pdf: https://doi.org/10.5555/3104322.3104425
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego. Available via https://arxiv.org/abs/1412.6980v9. https://doi.org/10.48550/arXiv.1412.6980
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2020) Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis 128:336–359
Article Google Scholar
Tanner JG, Healy MJR, Goldstein H, Cameron N (2001) Assessment of skeletal maturity and prediction of adult height: TW3 method. W.B Saunders Company, London, United Kindom
Google Scholar
Kim JR, Shim WH, Yoon HM et al (2017) Computerized bone age estimation using deep learning based program: evaluation of the accuracy and efficiency. AJR Am J Roentgenol 209:1374–1380
Larson DB, Chen MC, Lungren MP, Halabi SS, Stence NV, Langlotz CP (2017) Performance of a deep-learning neural network model in assessing skeletal maturity on pediatric hand radiographs. Radiology 287:313–322
Article PubMed Google Scholar
Tong C, Liang B, Li J, Zheng Z (2018) A deep automated skeletal bone age assessment model with heterogeneous features learning. J Med Syst 42:249
Article PubMed Google Scholar
Ren X, Li T, Yang X et al (2019) Regression convolutional neural network for automated pediatric bone age assessment from hand radiograph. IEEE J Biomed Health 23:2030–2038
Article Google Scholar
Tajmir SH, Lee H, Shailam R et al (2019) Artificial intelligence-assisted interpretation of bone age radiographs improves accuracy and decreases variability. Skeletal Radiol 48:275–283
Article PubMed Google Scholar
Koitka S, Kim MS, Qu M, Fischer A, Friedrich CM, Nensa F (2020) Mimicking the radiologists' workflow: estimating pediatric hand bone age with stacked deep neural networks. Med Image Anal 64:101743
Article PubMed Google Scholar
Mutasa S, Chang PD, Ruzal-Shapiro C, Ayyala R (2018) MABAL: a novel deep-learning architecture for machine-assisted bone age labeling. J Digit Imaging 31:513–519
Article PubMed PubMed Central Google Scholar
Lee JH, Kim YJ, Kim KG (2020) Bone age estimation using deep learning and hand X-ray images. Biomed Eng Lett 10:323–331
Article PubMed PubMed Central Google Scholar
Ontell FK, Ivanovic M, Ablin DS, Barlow TW (1996) Bone age in children of diverse ethnicity. AJR Am J Roentgenol 167:1395–1398
Article CAS PubMed Google Scholar
Zhang A, Sayre JW, Vachon L, Liu BJ, Huang HK (2009) Racial differences in growth patterns of children assessed on the basis of bone age. Radiology 250:228–235
Article PubMed PubMed Central Google Scholar
Hernandez R, Poznanski AK, Kelch RP, Kuhns LR (1977) Hand radiographic measurements in growth hormone deficiency before and after treatment. AJR Am J Roentgenol 129:487–492
Article CAS PubMed Google Scholar
Polito C, Di Toro A, Collini R, Cimmaruta E, D'Alfonso C, Del Giudice G (1995) Advanced RUS and normal carpal bone age in childhood obesity. Int J Obes Relat Metab Disord 19:506–507
CAS PubMed Google Scholar
Polito C, Greco N, Opallo A, Cimmaruta E, La Manna A (1994) Alternate-day steroids affect carpal maturation more than radius, ulna and short bones. Pediatr Nephrol 8:480–482
Article CAS PubMed Google Scholar
Convay D (2010) The data science Venn diagram. Available via http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Download references

Acknowledgements

Thanks for the relevant work go to these people: Chenjie Ye (Children’s Hospital of Fudan University), Haoting Feng (GE Company), Xiaolin Ge (Children’s Hospital of Fudan University), Qi Chen (Children’s Hospital of Fudan University), Chuqing Pan (Ping An International Smart City, China), Zhenzhong Xie (Ping An International Smart City, China), Haozhe Si (University of llinois at Urbana-Champaign).

Funding

This study has received funding by Shanghai Municipal Health Commission (No. 2019SY038. Establishment of bone age AI model for normal children in Shanghai and prediction of bone age for common endocrine diseases in children. PI: Zhongwei Qiao).

Author information

Authors and Affiliations

Department of Radiology, Children’s Hospital of Fudan University, No 399, Wan Yuan Road, Minhang District, Shanghai, 201102, China
Mei Bai, Min Ji, Bin Yang, Yingqi Sun, Minjie Zhang, Haowei Yang & Zhongwei Qiao
Ping An Technology, Shenzhen, China
Liangxin Gao, Jianbang Ge, Lingyun Huang & Jing Xiao
School of Public Health, Yale University, New Haven, USA
HaoChen Qiao
Department of Clinical epidemiology, Children’s Hospital of Fudan University, Shanghai, China
Xiaotian Chen
Information Technology Center, Children’s Hospital of Fudan University, Shanghai, China
Wenjie Zhang
Department of Endocrinology, Children’s Hospital of Fudan University, Shanghai, China
Feihong Luo
Department of Radiology, Ningbo Women and Children’s Hospital, Ningbo, China
Haibing Mei

Authors

Mei Bai
View author publications
You can also search for this author in PubMed Google Scholar
Liangxin Gao
View author publications
You can also search for this author in PubMed Google Scholar
Min Ji
View author publications
You can also search for this author in PubMed Google Scholar
Jianbang Ge
View author publications
You can also search for this author in PubMed Google Scholar
Lingyun Huang
View author publications
You can also search for this author in PubMed Google Scholar
HaoChen Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Jing Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaotian Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yingqi Sun
View author publications
You can also search for this author in PubMed Google Scholar
Minjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Feihong Luo
View author publications
You can also search for this author in PubMed Google Scholar
Haowei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Haibing Mei
View author publications
You can also search for this author in PubMed Google Scholar
Zhongwei Qiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Min Ji or Zhongwei Qiao.

Ethics declarations

Guarantor

The scientific guarantor of this publication is Zhongwei Qiao, who is the director of department of radiology in Children’s Hospital of Fudan University.

Conflict of interest

The authors of this manuscript declare relationships with Ping An Technology. Liangxin Gao, Jianbang Ge, Lingyun Huang, and Jing Xiao are staff members of Ping An Technology, who implement the bone age models. Patients’ data for the bone age models did not include patients’ private information, but only images, sexes, and chronological age. These authors declare that there is no conflict of interest regarding the publication of this paper. The other authors (from our hospital) report no competing interests.

Statistics and biometry

Xiaotian Chen kindly provided statistical advice for this manuscript, who is a staff of department of clinical epidemiology in the Children’s Hospital of Fudan University.

Informed consent

Written informed consent was waived by the ethics committee of the Children’s Hospital of Fudan University.

Ethical approval

Institutional Review Board approval was obtained.

Methodology

• retrospective

• diagnostic or prognostic study

• performed at one institution

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bai, M., Gao, L., Ji, M. et al. The uncovered biases and errors in clinical determination of bone age by using deep learning models. Eur Radiol 33, 3544–3556 (2023). https://doi.org/10.1007/s00330-022-09330-0

Download citation

Received: 12 June 2022
Revised: 13 October 2022
Accepted: 28 November 2022
Published: 20 December 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00330-022-09330-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The uncovered biases and errors in clinical determination of bone age by using deep learning models

Abstract

Objectives

Methods

Results

Conclusions

Key Points

Similar content being viewed by others

Deep learning-based automated bone age estimation for Saudi patients on hand radiograph images: a retrospective study

Artificial intelligence-assisted interpretation of bone age radiographs improves accuracy and decreases variability

Validation of an AI-Powered Automated X-ray Bone Age Analyzer in Chinese Children and Adolescents: A Comparison with the Tanner–Whitehouse 3 Method

Explore related subjects

Introduction

Methods

Data acquisition

Data preprocessing

Bone age AI model

Regional heatmap activation

BA clinical determination

Statistical analysis

Results

Performance of deep learning models

Regional heatmap

BA clinical determination

Discussion

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Guarantor

Conflict of interest

Statistics and biometry

Informed consent

Ethical approval

Methodology

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation