Introduction

Age estimation of individuals plays an important role in the forensic context including identification of the deceased, immigrants, legal and criminal prosecutions, and paediatrics for clinical diagnosis and treatment planning. The age classification is often strongly requested from authorities, different institutions, or courts when they need to determine whether an individual will be treated as a child, juvenile, or adult [1]. Hagen et al. [2] reviewed 597 expert reports to demonstrate the outcome of forensic age assessments of migrants with doubtful minority declaration and found that there were 37.8% minors in non-criminal cases and 74.5% unaccompanied minors have reached the age of majority. Misclassifications will result in the loss of minor protection or adult impunity [3]. The German Social Code VIII (SGB VIII) has governed that youth welfare offices should determine whether an unaccompanied foreign national is a minor through the professional visual inspection or medical examination [4]. Despite diverse relevant ages in different countries, the legal age for criminal liability is within the scope of 14 and 21 years [5,6,7]. China’s criminal law considers children younger than 14 having no full criminal responsibility, teenagers 14–16 years of age having limited or partial criminal responsibility, and teenagers older than 16 having full criminal responsibility. In Chinese civil law, children over 18 years of age have full civil responsibility [8, 9]. In an ethically sensitive scenario, it is technically unacceptable if a minor is misclassified as an adult. Since the development of bone completes early and is more influenced by environmental and nutritional factors compared with dental development, the dental age estimation is an alternative choice for determining whether an individual was a minor or an adult. Mincer et al. [10], Cameriere et al. [11], and Olze et al. [12] successively established dental age estimation methods through different systems, i.e. the Demirjian staging system, the third molar maturity (I3M), and four-stage classification of the periodontal ligament. In terms of the classification of the legal age thresholds of 14, 16, or 18 years, Cameriere et al. [13, 14] also reported that the cutoff values of the I3M and the second molar maturity (I2M) were efficient indicators that need to be distinguished.

Although the methods mentioned above have been accurately applied in different populations, there are still some limitations with respect to the clinical applications, such as the subjectivity and reproducibility of the technique and measurement bias. Additionally, these methods are labour-intensive and time-consuming. Recently, deep learning has been increasingly popular in medicine, especially convolutional neural networks (CNNs). The need for these advanced technology methods in the field of dentistry is also on the rise, with applications including the detection and classification of individual teeth [15, 16], dental age estimation [17, 18] and classification [19], the detection of apical lesions [20] and periodontal disease [21], and the diagnosis of osteoporosis [22]. In 2021, Galibourg et al. [23] compared the Demirjian and Willems methods with machine learning regression models for age estimation, and found that all the machine learning models based on the developmental stages defined by Demirjian were more accurate in dental age estimation. As for the age classification between minors and adults, Stern et al. [19] first reported a classifier derived using a random forest algorithm and a deep CNN based on magnetic resonance imaging (MRI) of the third molar, with an accuracy of just 85.09%. Deep learning has achieved some success in many fields of medicine; however, previous studies have explored only the performance of deep learning models without a comparison with manual methods, and the sample size of those models was also small. In addition, the reasons for the better performance of deep learning techniques were not clarified. Therefore, it is still necessary to test whether deep learning techniques can replace humans in some circumstances and to explore whether deep learning techniques can find other regulations and features that have not been previously found.

This study constitutes the first report of CNNs for dental age classification based on a large sample of orthopantomograms (OPGs) (10257) and explores the difference between the features defined by humans and the features extracted by machines. We compared traditional manual method based on the Demirjian method with CNN models for legal age threshold classification. In addition, gradient-weighted class activation mapping (Grad-CAM) was used to assess the differences in the features of interest between human beings and machines. Our results prove that CNN models can surpass humans in age classification and that the features extracted by machines may be different from those defined by human.

Materials and methods

Data acquisition

A total of 10,257 OPGs from 4579 males and 5678 females aged between 5.00 and 24.00 years were analysed. The study group was divided into 20 age groups by the year of age. As in age group 15, patients with ages ranging from 15.00 to 15.99 were involved. The OPGs (Sirona Dental Systems GmbH, Germany. Model-No.: D3352; Software-Version: 02.28; Hardware-Version: BA) were obtained from the records database of the Department of Oral Radiology, Stomatology Hospital of Xi’an Jiaotong University, between 2015 and 2019. All the OPGs were saved in the JPG format. Ethical approval was granted by the ethics committee of the Stomatology Hospital of Xi’an Jiaotong University (No. 2017-524). The inclusion criteria were as follows: (1) northern Chinese origin; (2) normal development and dental conditions; (3) no missing permanent teeth in the left mandible; (4) no pathologic condition affecting the maturity of teeth; and (5) no image distortion or lack of clarity. The distribution by age and gender is shown in Table 1. The records consisted of the gender, date of birth, date of exposure, and identification number.

Table 1 The distribution of the sample by age and gender

In the data division of the three experiments, we randomly selected 10% of each category as the validation set, 10% of each category as the test set, and the remaining data as the training set. The training set was used for the establishment of manual methods and the training of pre-proposed deep learning models. The validation set was only used for determining the optimal deep learning models, while the test set was for determining the classification performance of both manual and machine methods.

Manual classification

All OPGs were most separately estimated by two experts with extensive clinical experience and forensic training after a sufficient discussion. To evaluate the dental age, the first step was to estimate the developmental stage of the left mandibular eight permanent teeth in each OPG based on the Demirjian method [24], and then, the second step was to transform stages A to H into numeric stages 1 to 8, respectively. We established a logistic regression linear model with dichotomous variable Y, with 0 representing an individual under the age threshold and 1 representing an individual equal to or over the age threshold. We derived logistic regression linear models for each age threshold to measure the relationship between the dependent variable (Y) and eight independent variables (Sx, x = 1, 2, …, 8) or one independent variable (S8, the numeric stage of the third molar) in the training set. In the test set, we test the classification performance of the logistic regression models to determine an individual was younger than legal age threshold (Y = 0) or equal to the legal age threshold or older (Y = 1).

Machine classification

To verify whether the deep learning model can replace humans to automatically extract feature information for age classification on OPGs without manually defined staging system interference, this paper built an end-to-end classification network. By using a composite model scaling algorithm to optimize the width, depth, and resolution of the network, we selected the EfficientNet network [25] and ensured the classification accuracy. Simultaneously, to avoid accidental results, we conducted another set of controlled experiments. By adding SE blocks to the original ResNet 101 [26] network to construct the SE-ResNet 101 network, the correlation between feature channels is modelled, and important features are enhanced to ensure the classification ability of the classifier. The framework of this network is shown in Fig. 1. The EfficientNet network is a compound model scaling algorithm, which helps our model to achieve a significant improvement in the accuracy by comprehensively optimizing the network width, network depth, and resolution. At the same time, to ensure the inevitability of the final result, we also adopted SE-ResNet 101, which embeds the squeeze and excitation block (SE-block) into ResNet, and more nonlinear operations are added so that the model can better fit the complex correlation between channels. In addition, at the optimization algorithm level, the model uses the SGD optimization algorithm and only selects one sample at a time to update the model parameters, which promotes the faster convergence of our fusion network model. At the same time, to set a better learning rate, promote the model to converge, and avoid an “explosion” of the loss value of the objective function, we used an exponentially slowed learning rate optimization strategy, and the learning rate was decremented by the exponential difference every 30 rounds. This approach guarantees the results of learning. We also carried out data enhancement processing on these training data, such as random cropping and random horizontal flipping, which can urge the model to more fully learn the feature information of the OPGs. Figure 1 shows the whole framework of the legal age threshold classification.

Fig. 1
figure 1

The framework of legal age threshold classification. a Test dataset. b Training dataset. c Experts involved in assessing the medical image. d The end-to-end classification based on EfficientNet and SE-ResNet. C1–C5, convolution layers 1–5; SE, squeeze and excitation; FC, fully connected layer

Statistical analyses

All statistical analyses were performed using the statistical software IBM SPSS 18.0 (IBM® SPSS® Statistics, Armonk, NY). The chronological age of each subject was calculated by subtracting the date of birth from the date of the radiograph. Weighted Cohen’s kappa was used to evaluate the intra-rater and inter-rater agreement of the developmental stages of each tooth using the Demirjian method. Two experienced dentists assessed 100 randomly selected OPGs and re-examined them after an interval of 2 weeks. The age of each subject was blinded to the experts obtaining the estimation.

A two-by-two contingency table was used to evaluate the performance of each method to classify whether an individual was under the legal age threshold (14, 16, or 18 years) or not. The accuracy (the proportion of cases that are correctly classified), the sensitivity (the proportion of individuals who are truly under the age threshold who are classified as being under the threshold), and the specificity (the proportion of individuals who are truly equal to or over the age threshold who are classified as being equal to or over the age threshold) of the test were evaluated.

Results

Manual classification

For the reproducibility tests, the weighted Cohen kappa for the average intra-rater agreement was 0.921, whereas the average inter-rater agreement was 0.834.

By comparing the legal age threshold classifications of 14, 16, and 18 years, the results of this study indicate that the accuracy of the manual method is stable. As shown in Table 2, among the age thresholds at ages of 14, 16, and 18 years, the differences between the accuracy (ACC) are relatively small, with differences mostly lower than 2%. Most of Se are above 90% except for the Se of age threshold of 18 years old. When considering the age thresholds of 14 and 16 years, both the ACC and Sp are higher in the 31–38 (left mandibular eight teeth in the FDI numbering system) models, while the Se is higher in the 38 (left mandibular third molar in the FDI numbering system) models. However, at the age threshold of 18 years, the ACC, Se, and Sp are similar in both regression models except the Se, with 94.9% for 31–38 model and 87.7% for 38 model.

Table 2 The classification performance of the manual method and machine method based on 31–38 or 38

Regardless of stable accuracy, the manual method is still unsatisfactory, which is attributed to the subjectivity and reproducibility of the technique as well as measurement bias. Besides, the traditional manual age classification method is labour-intensive and time-consuming.

Machine classification

To confirm whether the procedure and feature determined by human are probable for age classification, we developed the end-to-end CNN models without human interference. The end-to-end CNN models of automated age classification took 0.003 s per image on average. From Table 2, the performance of the end-to-end CNN models is better than that of the manual method, with the ACC, Se, and Sp increased except for the Se of the age threshold of 14 years old for 31–38 model and 16 years old for 38 model as well as Sp of the age threshold of 18 years old. The ACC of age threshold of 14 and 16 years reaches over 95%, and that of 18 years old reaches 93.3%.

To explore the reason of better classification performance for machine method, we drew the Grad-CAM. Through analysing the heatmap images of the test group (Fig. 2), we find that instead of focusing on the tooth morphology, which exhibits a high density in the X-ray images, the deep convolutional neural network (DCNN) pays more attention to the low density of the images, such as the dental pulp cavity, periodontal membrane, the area between adjacent teeth, and the area between the deciduous teeth and permanent teeth. The results indicate that without the intervention of manual factors, machine learning can independently extract features in OPGs highly related to the age, and establish the complex comprehensive correlation between multiple features and chronological age.

Fig. 2
figure 2

The Grad-CAMs for different age thresholds based on 31–38 or 38

Similar results are obtained in manual and machine methods (Table 2). It is found that the age classifications of different age thresholds need different scales of dental development information. The ACC, Se and Sp of the 31–38 models are mostly higher than 38 models for age classification of 14 and 16 years old, while the age classification of 18 years old gets opposite results. For lower age thresholds, most individuals have different teeth under development around the age threshold, so the classification based on more teeth can obtain higher accuracy. However, for a high age threshold, most individuals only have third molars under development, so too many teeth taking into account may result in an inverse interference, which decreases the accuracy. In addition, the low sensitivity of the 38 models among the two methods in the age threshold classification of 14 years may be caused by the slow development speed of third molars when children are approximately 14 years old. Since the third molar develops slowly, the images of the third molar in the OPGs are similar, which makes it difficult to distinguish an individual under or over 14 years old.

Discussion

By comparing the legal age threshold (14, 16, and 18 years old) classification based on the traditional manual method and deep learning model, we find that the deep learning model established for age threshold classification can replace or even surpass the manual method. It is indicated that although the performance of manual method is stable, the end-to-end deep learning technique without human interference can effectively overcome the limitation of the manual method and greatly improve the classification performance. Moreover, for the Grad-CAM method, we find that instead of using dental morphological traits that are easy for a human to identify, the machine method focuses on the low-density features around the teeth, which may be the reason for the better performance of the end-to-end deep learning model.

The stability of the manual method may contribute to the ability of humans to avoid the feature interaction problem and noise interference based on their professional knowledge and previous experience to distinguish normal images from noisy images. There are various features in the OPGs correlated to age, but some of them do not change regularly with age and are not easy for humans to identify. As shown in Fig. 3, the distribution of dental development stages is different in different age groups, and the older the subject is, the more teeth finish development (stage 8), which indicates that the age classification of the manual method which only focuses on dental development is highly related to age, which makes the classification performance stable. However, there are still several limitations in age classification. First, traditional manual methods are likely to incorporate a relatively higher degree of intra- and inter-observer errors due to the subjectivity in the stage evaluation, which can lead to an increase in the prediction error [27]. Second, the relationship between dental development and age is a nonlinear function which is hard to specify, so we can only use the linear regression model to establish a local approximation. Third, the development of teeth is a continuous process. Demirjian et al. [24] divided tooth development into eight discontinuous stages, which cannot perfectly fit the complex relationship between dental development and age. Moreover, manual methods are labour-intensive and time-consuming, which makes them inconvenient to apply in routine clinical activity [28, 29].

Fig. 3
figure 3

The verification of the Demirjian method for age classification. a Estimating the dental development stages of eight left mandibular permanent teeth for age classification in OPGs. b The distribution of different dental stages of all eight teeth for 5–24 years old groups

In recent years, deep learning techniques have made breakthroughs in the fields of computer vision [30, 31], speech recognition, natural language processing, and bioinformatics [32, 33] achieved good results in many fields [34, 35]. Previous studies on age classification mainly focused on either the performance of traditional manual methods or deep learning models. This is the first study to verify whether the deep learning model can replace humans on dental age classification based on a large sample. We used an end-to-end deep learning technique to determine whether it can surpass the performance of manual method for age classification. By using this method, we conducted related experiments and concluded that the classification CNN methods can obtain the best prediction results in age threshold classification. Without the intervention of manual factors, machine learning can analyse the correlation between the age and the features of OPGs and establish the complex comprehensive correlation between multiple features and age. In addition, this learning model is motivated by data, which means that when the data are large enough, high-precision identification can be realized.

The Grad-CAM method can provide a heatmap of the adequately visualized interest regions of the deep learning models that contribute the most to increasing or decreasing the output. In the medical field, some studies [36,37,38,39] used the Grad-CAM method to identify the region of interest and validate the DCNN performance. In our study, we found that the DCNN pays more attention to the low density of the images, including the dental pulp cavity, periodontal membrane, the area between adjacent teeth, and the area between the deciduous teeth and permanent teeth. Since the dental pulp cavity experiences age-related pathological and physiological changes [40], Cameriere et al. [41] and Kvaal et al. [42] both established age estimation methods based on the age-related changes in the dental pulp cavity, which have been verified in some studies [18, 43, 44]. For the periodontal membrane, several methods have been proposed by Olze et al. [12, 45, 46] to classify an individual below or above the 18-year threshold, especially once root growth is completed. In our previous study [47], we also estimated the dental age based on the visibility of the periodontal ligament of the third molar in the northern Chinese population. Both the area between the adjacent teeth and the area between the deciduous teeth and permanent teeth change with age during the replacement of the permanent teeth, which can be supplementary features for dental age estimation. Human beings can only clearly identify the features with high density in X-ray images, while a machine can synthetically assess each feature in the images and select highly related regions for classification, combine these features, and fit the complicated relation of these features, which helps to increase the accuracy of the age classification.

Although our study established a satisfactory DCNN model for age classification performed better than traditional manual method, there are still two limitations. Firstly, this study only focuses on age classification, which is the foundation of age estimation, while whether the DCNN model is suitable for age estimation has not been confirmed. Secondly, the dental age estimation methods perform differently among various ethnic populations, but we only established the DCNN model and compared with manual method in a Chinese population. It is still not sure whether similar results can be obtained from other different populations.

In conclusion, comparing the age classification performance of the traditional manual method and the machine method, the end-to-end deep learning without human interference can surpass the performance of the manual method, and the features extracted in the CNN model are different from those defined by a human. This approach holds great promise for future forensic practice.