Introduction

Living age estimation is a crucial issue in forensic science, competitive sports, protecting children’s rights, and illegal immigrants. The hand is one of the most reliable methods to estimate age in children up to 14–15 years [1]. After hand maturation, the medial clavicle epiphysis (MCE) is considered the optimal choice for age estimation due to its ongoing development, strong correlation with age, and easy accessibility. Previous studies have demonstrated that MCE development continues until around 30 years old [1,2,3]. Thin-slice CT is recommended and commonly used to observe the developmental stages of MCE, as it provides clear visualization without overlapping [4].

Diverse qualitative or quantitative methods have been proposed for age estimation. While, Schmeling et al.’s staging method [5] and Kellinghaus et al.’s substage method [6] are the most widely used approaches and have been approved in regions such as Europe, Asia, and Africa. Except for the linear regression used by Shedge et al. [7], most studies have employed general descriptive analysis to investigate the age distribution across different stages for MCE-based age estimation. The minimum age principle, which is based on descriptive statistics, is utilized for age classification [8]. The minimum age principle is the youngest age of each stage in the reference study. If the minimum age of a stage is over the 18-year threshold, the occurrence of this stage predicts that the individual is older than 18 years. Most studies [3, 7, 9, 10] have demonstrated that individuals with grades 3c and above were all over 18 years old, highlighting the crucial role of MCE in classifying the 18-year threshold. However, the descriptive analysis is susceptible to bias from sample distribution [11].

Recently, machine learning (ML) has emerged as a powerful tool for age estimation. These ML approaches can establish nonlinear models with strong prediction ability [12]. The support vector machine (SVM) is the most used ML algorithm and showed excellent performance in age estimation and classification [11, 13,14,15]. Based on the handcrafted features, the SVM algorithm can effectively address the limitation of linear regression in handling non-linear relationships between age and bone features, thus enabling the exploration of complex nonlinear patterns [11].

Despite the advancements in machine learning, handcrafted features are still a crucial component of ML models. The shape variation of MCE, which cannot be analyzed manually, may affect the accuracy of age estimation. Deep learning (DL) can automatically identify and extract multi-level imaging features and achieve end-to-end learning. Several DL algorithms have been successfully applied in the field of age estimation, such as U-Net, Res Net, and VGG. To date, only a few studies used DL for analyzing the medial clavicle [16, 17], with their primary objective being to classify adult age from clavicle X-ray and the localization of MCE on CT. And a low accuracy was noticed in age classification. To our knowledge, no artificial intelligence study has performed automated clavicle feature extraction for age estimation.

This study aimed to explore the potential of ML in enhancing age estimation within the manual staging method and develop several DL models for automated age estimation. The models’ performance for age estimation and classification was compared and discussed in detail.

Materials and methods

Data acquisition

This retrospective study initially comprised 1049 patients aged 14.00–29.99 years undergoing chest CT scans for clinical routine at the West China Hospital of Sichuan University between October 2016 and June 2022. Patients who underwent chest CT examination in a supine position with raised arms were included in this study as recommended by Tozakidou et al. [18]. Patients who showed any pathology of the clavicle (e.g., fracture, infection, surgical fixation, or tumor) were excluded. Additionally, patients with malignant tumors, developmental disorders, endocrine disease, or those taking drugs or treatment affecting skeletal development were also excluded. Sex, date of birth, and examination data were collected from the medical database. Age was determined by the following formula: chronological age = (date of examination − date of birth) /365.25, rounded to two decimal places. Figure 1 shows the data acquisition flowchart. This retrospective study was approved by the Ethics Committee of Sichuan University. And informed consent was waived because of the retrospective nature. All methods were carried out following the relevant guidelines and regulations.

Fig. 1
figure 1

Flowchart of data acquisition and division for the model training and test

The CT scans were acquired using different scanners, including the uCT780 (United Imaging Healthcare, Shanghai, China), the SIEMENS SOMATOM Definition (Siemens, Forchheim, German), the Phillips Brilliance (Phillips, Amsterdam, Netherlands), and the GE Medical System Revolution (GE Healthcare, Milwaukee, WI, USA). All CT scans had a slice thickness of 1.00mm, a window width of 2600Hu, and a window level of 800Hu. The CT slices containing MCE were extracted manually and stored.

According to the age distribution and the normal or variant MCE, male and female samples were randomly selected as normal training and test sets and variant training and test sets. An additional 53 independent patients with clavicle variation were also collected and incorporated into the variant test set.

Manual methods

Only normal MCEs without shape variation were analyzed in this part. After multiplanar recombination, the axial and coronal views of MCE were examined to evaluate the degree of MCE development using standard staging procedures (Fig. 2) introduced by Schmeling et al. [5] and Kellinghaus et al. [6]. Both sides of the MCE were analyzed. A subset of 50 samples was randomly selected to evaluate the intra- and inter-observer agreement by using the weighted Kappa analysis. For intra-observer agreement, the subset was reevaluated by the first forensic examiner after one month, who has 6 years of experience in forensic age estimation from radiographs. For inter-observer agreement, the subset was evaluated by the second forensic examiner, who has 4 years of experience in this field. The machine learning models were trained and tested based on the staging results of the first forensic examiner. The grading results were assigned with a value of 1–9 for grades. The grades of both sides were used as independent variables to develop the linear and SVM model for age estimation.

Fig. 2
figure 2

Schematic drawings and CT images of the development stages of medial clavicle epiphysis

DL models training

Since the CT images were acquired from various device manufacturers, all images underwent intensity normalization to mitigate discrepancies. The CT slices containing MCE were subjected to filtering and standardized to 20 slices automatically by using the nearest neighbor interpolation algorithm, ensuring consistency in the number of input images for the network (Fig. 3a).

Fig. 3
figure 3

Overview of the automatic age estimation framework

Since MCE is relatively small in chest CT scans, automatic segmentation of MCE is performed to enhance the learning ability of the network. A total of 600 CT slices from 30 samples were used to train a U-Net to segment MCE automatically (Fig. 3b). Finally, the image size was resized to 512×512.

In this study, we trained and compared six DL models, including five traditional CNN models (VGG-19, Res Net 18 and 34, SE Net 18, and SK Net), and a hybrid neural network CoAt Net [19]. The DL network training process is shown in Fig. 3c. Five-fold cross-validation was performed, and datasets were randomly split into training and validation sets in a 4:1 ratio for each fold. All the DL models were trained under the PyCharm platform using an Nvidia GeForce RTX 2080 8GB GPU. The networks are trained with a batch size of two and optimized with an AdamW optimizer, while the weight decay parameter is set to 0.3. The initial learning rate is set to 0.0005 and it is attenuated by a factor of 0.3 after every 30 training cycles for a total of 60 training cycles.

Statistical analysis

The performance of all methods was evaluated on the same independent test set. The best DL model was used for the patients with shape variation of MCE. The Pearson correlation coefficient (r) between estimated age and chronological age was analyzed. The mean absolute error ± standard deviation (MAE±SD), mean error (ME), and root mean square error (RMSE) were compared.

$$MAE=\frac{1}{n}\sum_{i=1}^n\mid {EA}_i-{CA}_i\mid$$
(1)
$$ME=\frac{1}{n}\sum_{i=1}^n\left({EA}_i-{CA}_i\right)$$
(2)
$$RMSE=\sqrt{\frac{1}{n}{\sum}_{i=1}^n{\left({EA}_i-{CA}_i\right)}^2}$$
(3)

where EA means the estimated age, CA means the chronological age.

The accuracy, sensitivity, and specificity of age classification were evaluated for the age thresholds of 16, 18, 20, and 22 years (The important legal age limits in China’s criminal law and Civil Code) and the age threshold of 21 years (this age threshold is internationally of exceptional relevance). The receiver operating characteristic (ROC) curve was constructed and the corresponding area under the curve (AUC) was obtained to evaluate the classification performance of models.

$$\textrm{accuracy}=\frac{TP+ TN}{TP+ TN+ FP+ FN}$$
(4)
$$\textrm{sensitivity}=\frac{TP}{TP+ FN}$$
(5)
$$\textrm{specificity}=\frac{TN}{TN+ FP}$$
(6)

where TP represents true positive, which is the number of correctly classified samples as reaching or exceeding a specific legal age. TN represents true negative, which is the number of correctly classified samples as being below a specific legal age. FP represents false positive, which is the number of falsely classified samples as reaching or exceeding a specific legal age. FN represents false negative, which is the number of falsely classified samples as being below a specific legal age. Accuracy values reflect the classification accuracy in all samples. Sensitivity values reflect the classification accuracy in the individuals aged ≥X years. Specificity values reflect the classification accuracy in the individuals aged < X years. X means the age thresholds of 16, 18, 20, 21, and 22.

The observer agreements were analyzed using Cohen’s kappa coefficients. The Wilcoxon test was employed to determine differences between left and right MCE. The sex differences were compared by the Mann-Whitney U test. The statistical analysis was performed using IBM SPSS, version 22.0, and GraphPad Prism, vision 8.0. Significance was set at the 5% level.

Results

Dataset characteristics

The detailed characteristics of patients are listed in Table 1. The mean age and standard deviation (mean ± SD) of the1049 patients was 22.50±4.34 years, comprising 500 males (mean ± SD: 22.37±4.24 years) and 549 females (mean ± SD: 22.62±4.43 years). Among them, 109 samples with clavicle variation were excluded from the manual analysis but included in the DL (mean ± SD: 22.81±4.61 years). The mean age of the additional 53 variant patients was 24.63±4.08 years.

Table 1 Frequency distribution of training and test samples by age and sex

Manual staging results

The manual method yielded a κ value of 0.879 (95% confidence interval: 0.762–0.975) for intra-observer agreement and 0.855 (95% confidence interval: 0.736–0.950) for inter-observer agreement, indicating good levels of consistency among observers. The side differences were observed in 135 out of the 940 normal MCE samples, and the Wilcoxon test revealed a significant difference in the development of bilateral MCE (p=0.006). The Mann-Whitney U test indicated no statistically significant differences between males and females, except at stage 4 (p=0.004) (Table 2).

Table 2 Descriptive statistics for age by sex of all ossification stages expressed in years

Table 2 and Figure E1 show the age distribution of developmental stages by sex. The results revealed a significant positive association between the developmental stages of the MCE and chronological age. The youngest age of individuals with stage 3c and above was older than 18 years old in both sexes, indicating that 3c and above can aid in determining the 18-year threshold.

Estimation performance of deep learning versus manual method

Based on manual stages, the traditional linear model was constructed for age estimation (age= =12.958+0.834×right stage+0.768×left stage). An SVM model (kernel=‘poly’, C=1, degree=2) was also built for age estimation with the hyper-parameters optimized through grid search. The performance of both SVM and DL models for age estimation was evaluated on a normal test set, as shown in Table 3. Results indicate that the SVM model, based on manual feature extraction, can improve accuracy compared to traditional linear models. Among all DL models, SE Net 18 demonstrated the highest accuracy and was most similar to that of the SVM model. The attention heatmaps generated via Grad-CAM were used to visualize the region of interest (ROI) of the trained SE Net 18 (Figure E2). The heatmaps of SE Net 18 revealed that the network exhibited attention not only towards the medial clavicle end but also demonstrated a higher focus on the cortical bone region of the clavicle.

Table 3 The performance of different models for age estimation in the normal test set

The heatmap of estimation errors (Fig. 4) indicates that the phenomenon of ‘attraction of the middle’ was not prominent in this study. Only a few age groups at either end exhibited obvious overestimation or underestimation, with overestimation primarily observed among those aged 27 years or older. Additionally, it is visually apparent from the heatmap that most samples had an estimation error within ±2 years.

Fig. 4
figure 4

The heatmap depicting estimation errors of all eight models in the normal test set. The color of each cell represents the error of a single sample (Y-axis) predicted by a specific model (X-axis). The red indicates overestimation and the blue indicates underestimation. The higher error of estimation is associated with darker colorings

Classification performance of deep learning versus manual method

Table 4 and Figure 5 demonstrate that clavicle CT has limited value in the classification of the 16-year threshold due to its poor specificity in this study. This lower specificity may be caused by the limited samples under 16 years old in this study. While, it exhibits superior performance in the classification of 18-, 20-, 21-, and 22-year threshold. A comparable performance of age classification was noticed in all models. The confusion matrix of age classification in different age groups is shown in Figure E3 and the ROC curves are shown in Figure E4.

Table 4 The performance of different models for age classification in the normal test set
Fig. 5
figure 5

Radar chart of classification accuracy of all the eight models in the normal test set. Each axis arranged radially represents a model. Orange represents sensitivity. Blue represents accuracy. Green represents specificity

Performance of deep learning in variation test

Since the manual method was incapable of analyzing variant clavicles, DL models were trained with such variants to evaluate their ability on variant clavicles (Fig. 6). The models were also tested on an external variation test including 76 samples with variant clavicles. The results obtained from SE Net 18 showed that r was 0.751, MAE was 2.08±1.62 years, RMSE was 2.63 years, and ME was −0.33 years in age estimation. Although the accuracy was lower than that of normal samples, it also indicated that DL models could be used to analyze variant samples. The performance of age classification is shown in Table 5 and Figure E5. Despite the limited number and uneven distribution of the variant clavicles, the age classification accuracy in those samples remains relatively high.

Fig. 6
figure 6

Examples of variant clavicles and their estimation error. ac Medial metaphyseal concavities; de. depressed metaphyseal; f Epiphyseal cartilage was ossified before the ossification center was completely ossified. CA, chronological age; EA, estimated age

Table 5 The results of age classification on the 76 variant test samples

Discussion

The medial clavicle epiphysis is a crucial indicator for age estimation in late adolescence and early adulthood, particularly in the determination of the 18-year threshold. Consistent with previous studies [3, 7, 9, 10], this study confirms that stage 3c can be used to determine the 18-year threshold. Additionally, we employed ML and DL techniques to establish the age estimation models which have a superior performance compared to traditional linear model.

The staging methods proposed by Schmeling et al. [5] and Kellinghaus et al. [6] were commonly used for age estimation from MCE. Thin-slice CT of 1mm or less can more clearly observe the epiphyseal line and avoid visual deception by the partial volume effect [20]. Consistent with previous studies, this study found that stage 3c and above may serve as a reliable indicator for determining the 18-year threshold (Table E1) [1, 3, 6, 7, 9, 10]. While previous research has indicated that stage 5 typically emerges around age 25 in both sexes [7, 10, 20, 21]. Similarly, we found that the earliest observation of stage 5 was at age 26 in females and age 24 in males. However, this study only found 10 samples exhibiting stage 5, indicating that the epiphyseal line may disappear after the age of 30 when using thin-slice CT imaging.

Previous studies have generally found nonsignificant differences between the two sides of MCE [9, 21,22,23], but our study revealed a significant difference as determined by the Wilcoxon test. This discrepancy may be attributed to the shape variation of MCE, ethnicity variations, and the quantitative measures of substages. The difference between males and females is statistically insignificant except for stage 4 in this study, which is consistent with previous studies [1, 6, 7, 20] that most of the stages exhibit no sex differences. Conversely, some studies have identified significant differences between the sexes. Since the insignificant difference between sexes and significant difference in bilateral stages, the bilateral stages were used for regression analysis. Males and females were analyzed together to ensure an adequate sample size for DL analysis. De Tobel et al. [24] also suggested incorporating both clavicles in the age estimation model.

The statistical analysis and the presentation of the conclusion are fundamental parts of age estimation [25]. Previous studies have used general descriptive analysis to estimate age and determine the 18-year threshold. However, this method is susceptible to mimicking the age structure of the reference population [11]. Shedge et al. [7] trained a linear regression model for age estimation from MCE on 350 samples and tested it on 50 samples with an MAE of 1.32 years. Due to variations in samples and CT machines across different studies, it is impossible to fairly compare the accuracy of age estimation. In this study, SVM outperforms the linear model (p= 0.0004) with an MAE of 1.73 years in the same dataset. The lower performance of the linear model may be attributed to its unsuitability for ordinal variables and its inability to accurately depict the relationship between age and epiphyseal development [11]. The possible reasons for this slightly lower accuracy compared to Shedge et al. include: The sample size in this study is nearly three times that of Shedge et al.’s [7], allowing for a more comprehensive analysis of MCE development but introducing more uncertain factors; a 1.0 mm CT was utilized which may be inadequate in displaying MCE compared to the 0.6 mm used by Shedge et al.

Although SVM can improve the accuracy of age estimation, it should be noted that traditional ML models are still based on manual staging which is subjective, time-consuming, and complex. Deep learning can automatically extract and integrate multiple age-related features from multi-slice clavicle CT scans. Deep learning has been applied for age estimation for years and achieved good results [26,27,28,29,30]. Previous automated bone age estimation mainly focused on the wrist, knee, and pelvis. Due to the relatively small size of the clavicle in chest CT, automatic age estimation from MCE requires segmentation to enable the DL model to focus on MCE and disregard extraneous noise. The sole investigation of clavicle CT involved the automated localization of medial clavicular epiphysis (MCE), achieving an accuracy rate of 88% in the test set [16]. Differing from solely automated localization, our study implemented automated segmentation and age estimation.

Both traditional CNNs and a hybrid model CoAt Net were chosen as the backbones for automated age estimation. The results show that the traditional CNN model outperforms CoAt Net. Regrettably, CoAt Net did not achieve satisfactory evaluation performance as previous studies suggested [19, 31]. One possible reason is that the large capacity of CoAt Net requires a substantial image to capture global relationships. However, this study only had slightly over a thousand samples, which may not fully optimize the model parameters and thus fail to train the model effectively. In contrast, CNNs have relatively modest data size demands and concentrate on local features, making them more appropriate for age estimation from medical images. This study is the first application of DL techniques in achieving end-to-end age estimation from clavicle CT scans, with accuracy levels comparable to those obtained through traditional ML methods. And the attention heatmaps revealed a higher focus on the cortical bone region of the clavicle in this study. In the future, further exploration can be performed for deep learning-based age estimation using either MCE or cortical bone alone to assess their respective contributions in this process.

Variant MCE has been observed in many studies, with approximately 10–20% of samples exhibiting non-assessable shape variants [20, 21, 32,33,34,35]. In this study, we have identified 109 samples of variant MCE (10.39%) that cannot be analyzed using the current staging method. This normal morphological variation poses a bottleneck in accurately evaluating MCE for age estimation using the traditional manual staging method. The morphological variants were excluded in the age analysis in previous studies [24]. In this study, automated age estimation was also applied to the variant clavicles and demonstrated the effective handling of DL for such cases, albeit with a slightly lower accuracy compared to normal samples.

There are limitations in this study. First, the age distribution is unbalanced due to its retrospective nature. Future large-scale studies with different ethnicities are still needed. And more advanced methods such as EasyEnsemble or deep learning techniques are expected to be applied for data balancing in the future. Additionally, it is recommended to gather more diverse MCE data for training DL networks to improve their ability in variable samples. Second, CT was used to represent the MCE in this study, which has a health risk from ionizing radiation exposure. In recent years, some studies have been carried out on age estimation from clavicular MRI [34, 36, 37]. This automated analysis of clavicle CT is also expected to be transferred to clavicle MRI in the future. Third, this study represents the first attempt at automated age estimation from clavicular CT scans. To achieve accurate results, a comprehensive approach that incorporates multiple joints is necessary. Previous studies have combined MRI of the MCE, wrist, and teeth to estimate age and have found that using all three joints produces better results than relying on a single joint [38].

In conclusion, this study objectively evaluated and compared the performance of the linear model, ML, and DL for age estimation. The results demonstrate a significant improvement in accuracy using ML and DL. This study can be a robust computer-assisted tool to aid the experts in age estimation.