Keywords

1 Introduction

Bone age is a quantification of the skeletal development of children. This measurement differs from chronological age as it goes from 0 to 20 years, while varying according to gender, age and ethnicity. Bone Age Assessment (BAA) is performed by radiologists and pediatricians to diagnose growth disorders or to determine the final adult height of a child [17]. This evaluation is commonly accomplished through visual inspection of ossification patterns in a radiograph of the non-dominant hand and wrist. The most commonly used manual methods are Greulich and Pyle (G&P) [8] and Tanner and Whitehouse (TW2) [17]. In G&P the entire hand is classified into a stage, while in TW2 it is divided into 20 Regions Of Interest (ROIs) that are analyzed individually to estimate the patient’s bone age. As a result, TW2 is more precise because it makes a local analysis of the hand. However, both manual approaches are prone to intra and inter observer errors due to the level of expertise of the radiologist or possible variations in the radiograph.

To improve the accuracy of BAA, there has been a growing interest in the development of automated methods. The commercial software BoneXpert [18], created in 2009 by Hans Henrik Thodberg and Sven Kreiborg, is currently used in clinical settings. This tool performs BAA based on edge detection and active appearance models [2] to generate candidates and make comparisons according to G&P and TW2. However, the method was developed using patients from a Danish cohort, therefore the reliability is not guaranteed when assessing data from other countries.

Recently, the Radiological Society of North America (RSNA) created a BAA dataset for the 2017 Boneage Pediatric Challenge [9]. The RSNA Challenge encouraged the development of several deep learning and machine learning approaches to accurately perform BAA [9]. The winners of the challenge achieved a Mean Absolute Difference (MAD) of 4.26 months over the test set [1]. Similarly, most top-performing methods used shallower neural networks to extract features from the entire image. Other methods uniformly extracted overlapping patches or first segmented the bones to localize the analysis, obtaining 4.35 and 4.50 MAD, respectively [9]. Similarly, [12] developed an approach to focus on the carpal bones, the metacarpal and proximal phalanges, achieving 4.97 MAD.

Although some of the approaches above build explicitly on local information when assessing bone age, existing BAA datasets [3, 6, 7, 9] provide only bone age annotations at the image level, and hence they are not designed to exploit the information of anatomical ROIs. A suitable approach for identifying ROIs in the hand is through hand pose estimation. This task has been studied in the context of 3D hand models directed towards human computer interaction, virtual reality and augmented reality applications [4, 5]. In this work, we propose a 2D framework focused on radiological hand pose estimation as a new task, enabling various medical applications in this field.

We present the Radiological Hand Pose Estimation (RHPE) dataset containing 6,288 images of hand radiographs from a population with different characteristics than the currently available datasets, ensuring a high variability for a better model generalization. In addition to the new dataset, we introduce hand detection and hand pose estimation as new tasks to extract local information from images. To establish a robust framework, we collect manually annotated keypoints for anatomical ROIs and bounding boxes of each hand radiograph. An example of our annotations is presented in Fig. 1. We also provide bounding boxes and keypoint annotations for the RSNA dataset. We evaluate the performance of state-of-the-art methods on our proposed tasks on both RSNA and RHPE datasets, and we propose a new local approach to BAA called BoNet that significantly outperforms all existing approaches. Additionally, we prove that both datasets are complementary and can be combined to create a robust benchmark with a better model generalization, regardless of the population’s characteristics.

Fig. 1.
figure 1

Different examples of our keypoint and bounding box annotations. We provide groundtruth for hand detection and hand pose estimation for both RHPE and RSNA datasets.

Our main contributions can be summarized as follows:

  1. 1.

    We introduce RHPE, a new dataset from a diverse population, and create a new benchmark for the development of BAA methods.

  2. 2.

    We provide the first manually annotated bounding boxes and keypoints in the RSNA and RHPE datasets. These annotations enable a new experimental framework including hand detection and hand pose estimation as tasks for the extraction of local information from hand radiographs.

  3. 3.

    We present BoNet, a novel CNN architecture that leverages anatomical ROIs to perform BAA. BoNet significantly outperforms all state-of-the-art methods on the RSNA and RHPE datasets.

To ensure reproducibility of our results and to promote further research on BAA, we provide the RHPE dataset and the corresponding annotations for train and validation, as well as the trained models, the source code for BoNet and the additional annotations created for the RSNA dataset [9]. We also deploy a server for automated evaluation of the test set.Footnote 1

2 Radiological Hand Pose Estimation Dataset

2.1 Dataset Description

We collect the RHPE data from a population that is different from the ones in the currently available datasets for BAA. The database comprises images of radiographs taken from left and right hands of both male and female patients between 0 and 240 months of age (0–20 years), with bone age annotations made by two expert radiologists for each patient. The dataset is composed of 6,288 images divided into 3 sets: 5,492 for training, 716 for validation and 80 for testing, maintaining the proportion of images used in each split of the RSNA dataset. 54% of the dataset corresponds to female patients and 46% to male patients. This division has the same proportions as the RSNA dataset and the Gaussian distribution of bone age on our dataset and on the RSNA dataset is approximately the same, centered around 126 months of age. A similar bone age distribution between the datasets suggests that they are compatible and can be used to study the influence of ethnicity on bone age assessment algorithms. See supplementary material for a further analysis of the similarities and differences of both datasets.

We gather anatomical landmark annotations from 96 trained subjects. For each image, the subject is shown an example of where the keypoints should be located. We obtain multiple annotations per image and perform outlier rejection by identifying the annotations that are 2 standard deviations away from the mean. From this procedure, we obtain at least 4 annotations per image made by different trained subjects. With 17 keypoints per hand radiograph, this accounts for more than 1.3 million annotated keypoints. These annotations correspond to the proximal, middle and distal phalanges, the carpal bones, and the distal radius and ulna. For compatibility, we store all our annotations using the MS-COCO [13] format. For the detection groundtruth, we include the upper-left coordinates, width and height of the bounding box that encloses the hand.

2.2 Tasks

Hand Detection. The goal of hand detection is to determine the location of a specific hand in the image. The importance of including detection as a task in our dataset lies in the fact that the images in RHPE include both hands and it is necessary to isolate the non-dominant hand for the assessment. For evaluation, we use the same standard metrics as in MS-COCO for object detection: mean Average Precision (mAP) and mean Average Recall (mAR) at Intersection over Union (IoU) thresholds @[.50 : .05 : .95].

Hand Pose Estimation. In this task, the objective is to estimate the position of anatomical ROIs in the hand radiograph. For the evaluation of hand pose estimation algorithms, we use the mAP and mAR at Object Keypoint Similarity (OKS) [13] @[.50 : .05 : .95]. It is worth noting that the evaluation code used in MS-COCO only takes into account instances that were accurately detected. To obtain a more realistic assessment of performance, we modify this metric to consider the effect of every image regardless of the detection mAR@[.5:.05:.95]. Additionally, the standard deviation \(\sigma _{i}\) with respect to object scale s varies significantly for different keypoints. In full-sized images, our \(\sigma \) penalizes any keypoint estimation 10 pixels away or more from the mean location. See supplementary material for additional information.

Bone Age Assessment. In the BAA task, we aim at estimating bone age in months for a given hand radiograph. To evaluate the performance, we use the same metric as in the RSNA 2017 Pediatric Bone Age Challenge: the Mean Absolute Distance (MAD) between the groundtruth values and the model’s predictions. We evaluate our performance on the RSNA dataset using the challenge evaluation server.

2.3 Baselines

Hand Detection. As baseline for the hand detection task, we use Faster R-CNN [15] with an ImageNet pre-trained model and ResNet-50 [11] as backbone. This widely used object detector consists of a network that generates and scores object proposals. We train the Faster R-CNN [15] method for 5,000 iterations with a constant learning rate of \(2.5\times 10^{-3}\) using the implementation released in [14].

Hand Pose Estimation. To address hand pose estimation, we build on the recent state-of-the-art architecture proposed by Xiao et al. in [19] for human pose estimation. This model consists of an encoder-decoder based on deconvolutional layers added on a backbone network. We train the model initialized on ImageNet with ResNet-50 as backbone [11] using the suggested training parameters for 20 epochs.

Bone Age Assessment. As baseline for the BAA task, we re-implement the method proposed by the winners of the RSNA 2017 Pediatric Bone Age Challenge. However, our inference differs from that method because [1] used an ensemble of the best models at inference while we use only a single model. This model uses an Inception-V3 [16] architecture combined with a network to include gender information, and adds two 1000-neuron densely connected layers to produce the final prediction. For the baseline, we train the model for 150 epochs, using Adam optimizer with an initial learning rate of \(3\times 10^{-3}\).

3 BoNet

Inspired by the way physicians perform BAA, we introduce a method that leverages local information to accurately address this task. For this purpose, we first locate the hand and find the anatomical ROIs. Afterwards, we create attention maps by generating a Gaussian distribution around each anatomical landmark. Our network, which we call BoNet, takes as input both the X-ray image and the attention map, and extracts high-level visual features from them using two independent pathways. We then combine these features from both pathways through a mixed Inception module and follow the suggestion in [1] to include gender information. Finally, after two fully-connected layers, BoNet regresses to the bone age using an L1 loss. See supplementary material for additional information. Figure 2 illustrates an overview of the complete method. To train BoNet, we start from our BAA baseline’s weights. We use Adam optimizer with an initial learning rate of \(1\times 10^{-4}\) over 50 epochs and reduce the learning rate by a factor of 0.1 once a loss plateau is reached.

Fig. 2.
figure 2

Overview of the pipeline used in BoNet. The original image goes through hand detection and hand pose estimation to identify ROIs. Then we use as input for BoNet the cropped image, the heatmap and the gender for BAA.

4 Experiments

Hand Detection. For this task, we evaluate the performance of our baseline method on the RSNA and RHPE datasets. Since the RHPE dataset contains both left and right hands on the image, we evaluate the detection of the left one, statistically assuming that it is the non-dominant hand [10]. Table 1 shows the results obtained in the validation split for both datasets. The performance of Faster R-CNN given by the mAP and the mAR at different IoU thresholds (@[.5:.05:.95]) is considerably high. Specifically, the mAP@[.75] indicates an excellent localization of the detections. This behaviour is appropriate considering that detecting the bounding box of the hand is the first step towards using local information for BAA. Consequently, precision and recall in hand detection will significantly affect the performance of BAA. The errors in finding bounding boxes can be associated to the detection of false positives contained inside the annotation bounding box and to the low IoU of most true positives. Since the RHPE dataset contains both left and right hands in the image, the hand detection task is more complex than for the RSNA dataset.

Table 1. Results of the hand detection task in the validation split for the RSNA and RHPE datasets using our baseline.

Hand Pose Estimation. With the aim of determining the relevance of bounding box detection for hand pose estimation, we use the datasets separately following three modalities: full image of the radiograph, image cropped with groundtruth bounding boxes and image cropped with detected bounding boxes. In the RHPE dataset we consider full image as the left half of the radiograph to only include the non-dominant hand. The results obtained for the validation set are reported in Table 2. The results prove that, for both datasets, the performance of the hand pose estimation task is considerably affected by the input used. Thus, we establish the upper bound for this task by using the groundtruth bounding boxes. Consequently, the performance of our predicted bounding boxes is lower than using groundtruth information since it depends on the results from the hand detection task. In contrast, the full image includes noise associated with background, tags and other artifacts in the radiograph, hence it obtains the lowest precision. The low performance in the full image setup show that it is necessary to use as input for this task a bounding box of the hand radiograph.

Table 2. Comparison of results in the validation set of RSNA [9] and RHPE datasets using our baseline for the hand pose estimation task.
Table 3. BAA results on the RSNA and RHPE test sets.

Bone Age Assessment. We design three sets of experiments to study the effect of training on different data. The first set uses only RSNA, the second one uses only RHPE, and the third one combines both datasets. For each set we assess the importance of local information by training on whole and cropped images. We use the training and the validation splits during the training stage and evaluate our results on the test split. The results shown on Table 3 demonstrate that hand detection is beneficial for accurate bone age assessment. Additionally, we observe that BoNet leverages effectively local information, achieving a significant improvement in performance over the re-implementation of state-of-the-art which is our baseline with full image. We also find that combining both datasets during training produces better results than training on a single dataset. These results indicate that increasing and diversifying the data is beneficial for model generalization. Regarding the time complexity of the algorithm, the final model using BoNet and cropped images takes 0.079 s per image on inference, making it a suitable choice for a future real time implementation.

5 Conclusions

We introduce the Radiological Hand Pose Estimation Dataset as a benchmark for the development of robust methods for BAA, hand detection and hand pose estimation in radiological images as a way of exploiting local information as done by physicians in current clinical practice. For each task, we propose an experimental framework and validate state-of-the-art methods as baselines. Our results prove that the use of local information is beneficial for BAA. We also develop BoNet, a new method based on exploiting local information that outperforms the state-of-the-art method that exploit only global information. The RHPE Dataset and its associated resources will push the envelope further in the development of robust BAA automated methods with better generalization regardless of the population’s characteristics.