Background

Root canal treatment involves cleaning, shaping and obturation of the root canal system to prevent or treat apical periodontitis [1]. Despite relatively high success rates (82–92%) [2], endodontic treatment still carries risks of failure that can be influenced by procedural errors and mishaps [3]. Studies have demonstrated that errors including apical perforation, failing to achieve patency due to ledges or blockages, and improper obturation length can significantly reduce success rates [4,5,6]. This is concerning as such errors may lead to post-operative complications and tooth loss.

Anatomical complexities and aberrations in tooth crown and root canal morphology are key factors that may increase risks of procedural errors when treating difficult cases [7]. As such, appropriate pre-operative case assessment and referral of complex cases by general dentists to endodontic specialists is critically important to improve outcomes [8]. To aid standardized assessment of case difficulty, guidelines like the American Association of Endodontists’ (AAE) Endodontic case difficulty assessment categorize complexity based on multiple criteria visible on radiographs [9]. These include tooth type, arch position, rotation, extent of crown destruction, root morphology, apex diameter, and canal visibility. However, the manual application of these guidelines is time-consuming and prone to high subjectivity and interpreter variability [10, 11]. More objective automated assessment tools are needed to help general dentistry practitioners reliably gauge case difficulty from standard radiographs early on and identify cases warranting specialty referral.

Recent advances in artificial intelligence (AI), specifically deep learning, show strong promise for automating such complex diagnostic and treatment planning tasks in healthcare [12]. Deep learning utilizes multi-layered neural networks capable of automatically identifying intricate patterns and relationships in data without the need for explicit human programming. Within medicine, deep learning has already demonstrated expert-level performance analyzing medical images for various tasks. For example, deep learning models have shown accuracies rivaling healthcare specialists in diagnosing diabetic retinopathy from retinal fundus images [13] and predicting skin cancer from clinical images [14]. Within the dental field, preliminary research has applied deep learning models for tasks including tooth numbering [15, 16], caries diagnosis [17], detection of periapical lesions [18], and extraction difficulty of third molars [19, 20]. However, deep learning has traditionally relied heavily on supervised learning techniques, which require massive manually annotated datasets that are expensive, time-consuming, and prone to human subjectivity, inconsistency, and errors [21]. Transfer learning can mitigate this by initializing models with general image features learned on large datasets, before fine-tuning on more limited medical data. An alternative approach is self-supervised learning (SSL), which can pre-train neural networks on abundant unlabeled medical imaging data [22]. SSL models learn meaningful feature representations from the data itself without the need for manual labeling, which is especially valuable for specialized fields like dentistry, where annotated data is scarce. Moreover, studies have shown SSL models can surpass supervised models in disease detection from retinal images [23].

For endodontic case difficulty assessment, which involves assessing multiple anatomical factors, an SSL approach seems promising. However, no previous studies found that have investigated deep learning techniques for standardized pre-operative assessment of non-surgical endodontic treatment case difficulty. In this study, we aimed to develop and validate a diagnostic tool using deep learning on periapical radiographs to determine endodontic case difficulty based on established guidelines.

Materials and methods

Study design

This retrospective study utilized deep convolutional neural network models to assess endodontic case difficulty from periapical radiographs based on AAE guidelines. In our research paper, we examined five pre-trained and state-of-the-art convolutional neural network (CNN) architectures, namely ResNet50, ResNet18, Inception V2, VGG16 (with batch normalization layers), and ResNext50. Additionally, we explored four self-supervised learning (SSL) approaches, namely SimCLR, MoCo, BYOL, and DINO. All models were employed to categorize endodontic case difficulty for binary classification analysis. Additionally, the top performing model was leveraged to predict overall difficulty scores through regression. The Ethics Committee of the Hamadan University of Medical Sciences approved this study (IR.UMSHA.REC.1402.026). The study results were reported in accordance with the Checklist of Artificial intelligence in Medical imaging [24].

Dataset and preparation

A dataset of 1,386 periapical radiographs of adult patients was compiled, including images from the radiology department of Hamadan University of Medical Sciences and a private dental clinic in Hamadan, Iran. Radiographs were captured using MINRay (Soredex, Tuusula, Finland) radiology system and Optime (Soredex, Tuusula, Finland) size #2 phosphorplate sensors. Radiographic exposure settings were standardized with a tube voltage of 60 kV, tube current of 7 mA, and exposure times ranging from 0.16 to 0.32 s adjusted based on tooth type. Inclusion criteria were permanent teeth with fully visible crowns and roots without obscuring artifacts. Exclusion criteria were deciduous teeth, impacted teeth, presence of orthodontic appliances, and poor image quality due to processing errors or patient motion or any other artifacts.

Ground truth annotations

All images were de-identified using randomized numeric labels. The periapical images needed for the research were selected by the main researcher (S.S) and two dentists (N.G and A.M) labeled the dataset for endodontic case difficulty based on the latest AAE guidelines. Difficulty ratings of low (1 point), moderate (2 points) and high (5 points) were assigned for the following criteria: tooth type, inclination, rotation, crown anatomy, root morphology, apex diameter, and canal visibility, mirroring the scoring system used in the “simple assessment” in AAE EndoCase mobile application.

The AAE guidelines involve both subjective assessments and objective measurements. Objective measurements were performed for the following features using Digimizer software v5.4.9 (MedCalc Software, Mariakerke, Belgium): Tooth length, Inclination: Tooth angle deviation in the mesiodistal dimension and canal curvature: Angle of canal curvature measured using Schneider’s method. Apex diameter is categorized based on morphology - blunderbuss apexes were considered open (> 1.5 mm) while parallel-walled open apexes were considered “between 1 and 1.5”. For crown destruction, cuspal coverage restorations and missing cusps were considered extensive. Reference images with measurements were used to improve standardization of tooth length and diameter ratings.

To establish standardized criteria, an initial set of 100 periapical radiographs were annotated by two dentists. Their difficulty ratings were reviewed by a third senior endodontist, and any disagreements or uncertainties were discussed to reach a consensus. This process refined the assessment criteria and improved inter-rater reliability. The two dentists then independently labeled the remaining radiographs in the dataset using the finalized criteria and rubric. One researcher (S.S.) evaluated the differences in ratings between the two researchers. In cases of disagreement, two board-certified endodontists with at least 10 years of experience (H.K., and E.K) were consulted to provide the decisive rating. AAE low and moderate difficulty were combined into an easy category since low difficulty corresponds only with a few anterior and premolar cases. Cases with combined scores ≤ 10 were categorized as “easy” while scores ≥ 11 were labeled as “hard” (high difficulty) for binary classification.

Image preprocessing

Individual teeth were cropped from the periapical radiographs by a trained dentist (S.S). Images were cropped with at least 10-pixel margin and converted to JPEG format. They were resized to size 224 × 224 for all models except Inception v2 which required 299 × 299 images.

Model architecture

Baseline models

Transfer learning was used to improve model training efficiency. All models were initialized with weights pre-trained on ImageNet. By transferring knowledge from this large general image dataset, the network could focus on fine-tuning dental radiograph features rather than learning from scratch. This enabled faster convergence with less data compared to a randomly initialized model. When fine-tuning the ImageNet pre-trained models, all layers were frozen except for the batch normalization layers. This allowed adjustments to the distribution of layer inputs to better fit the dental radiograph data, while retaining the learned feature representations from ImageNet in the convolutional layers. Additional, fully connected layers were constructed on top of the base models to generate predictions for the endodontic case difficulty tasks.

SSL models

Self-supervised pretraining was performed using contrastive learning technique [25]. The key idea is to train models to differentiate between augmented views of the same image (positive pairs) and views from different images (negative pairs). This forces the model to learn generalized visual representations based solely on the medical image data, without using any labeled categories. Typically, two separate augmented crops are created from each unlabeled dental radiograph. An encoder processes one augmented view while a separate encoder looks at the other view. If the crops originate from the same image, the encoded representations should be pulled closer together by the model. If they are from different images, the representations should be pushed apart. By optimizing this contrastive signal across many image pairs, the model learns robust features, unconfounded by any downstream task labels. For pretraining, we leveraged a diverse dataset of 20,295 unlabeled panoramic, bitewing, and periapical radiographs from a private clinic. After unsupervised pretraining, the encoder was transferred to initialize our classification model. The pretrained features were frozen, and a classifier was trained on top using the smaller labeled dataset to categorize case difficulty.

Training details

Models are implemented in Python using PyTorch 1.7.1 on Google Collaboratory platform. Training occurred on an NVIDIA T4 Tensor core graphics processing unit with 12GB GDDR5 VRAM, paired with an Intel Xeon processor containing two 2.2 GHz cores and 13GB of RAM. Key hyperparameters were set at a learning rate of 0.001, batch size of 4 (baseline models) and 8 (SSL models), and Adam optimizer, following a randomized search strategy. The loss function is calculated via categorical cross-entropy for the classification task and mean squared error for the regression task. Due to dataset imbalance, we applied the weighted loss function. Early stopping was used to prevent the overfitting of the model. Data were augmented using random horizontal flip, random rotation, color jitter, random affine, and TrivialAugment method [26]. TrivialAugment is an automatic augmentation method that takes an image x and a set of augmentations A as input. It then simply samples a random augmentation from A uniformly, as well as a random augmentation strength m from the range [0,30]. The sampled augmentation is applied to image x with probability m, and the augmented image is returned.

Data partitions

Initially, 202 cropped tooth images were used for our test using a random stratified sampling method. The model was trained using 10-fold stratified cross-validation to prevent overfitting and assess generalizability. The dataset was randomly split into 10 equal folds, with each fold containing a similar distribution of easy and hard cases. For each fold, the model was trained on the other 9 folds and validated on the held-out fold. This was repeated until all folds served as the validation set once. All model hyperparameters were tuned using the 9-fold training sets only. The cross-validation results were then aggregated to evaluate model performance. Model generalizability was assessed by averaging accuracy across the 10 validation folds.

Clinician assessment

Periapical images from the test set were provided to three general dentists and four endodontists for individual assessment separately. The examiners, comprising general dentists with an average of 2.6 years and endodontists with an average of 8 years of clinical experience, were given a brief orientation on case difficulty assessment using the simple assessment in EndoCase application. No formal calibration was performed in order to evaluate independent analysis. The evaluations were conducted under consistent controlled conditions, with images displayed on standard 1080p monitors in a darkened room to minimize external distractions and simulate ideal clinical viewing settings. Working independently, the evaluators categorized each tooth into ‘hard’ or ‘easy’ groups based on their clinical experience and judgment, guided qualitatively by the criteria outlined in the assessment tools.

Evaluation

We evaluated models and clinician performance on a held-out test set. The accuracy, precision, recall, and F1-score of the model/clinician on the test set were presented for each class and dataset.

$$Precision = \frac{{TP}}{{TP + FP}}$$
$${\text{Recall = }}\frac{{TP}}{{TP + FN}}$$
$${\text{Accuracy = }}\frac{{TP + TN}}{{TP + TN + FP + FN}}$$
$${\text{F1 score = }}\frac{{2TP}}{{2TP + FP + FN}}$$

Where TP, TN, FP, and FN are the number of true-positive, true-negative, false-positive, and false-negative samples, respectively. Confusion matrices were generated and receiver operating characteristic (ROC) curves plotted with Area Under the Curve (AUC) metrics assessed for each model. Interobserver agreement performance was assessed using Fleiss kappa. The results were interpreted as follows: Kappa < 0.2: Slight agreement, 0.21–0.4: Fair agreement, 0.41–0.6: Moderate agreement, 0.61–0.8: Substantial agreement, 0.81: Almost perfect agreement. Statistical analyses were performed using SPSS for Windows version 15 (SPSS Inc., Chicago, IL, USA).

Results

Study populations

The distribution of case difficulty items is presented in Table 1. Our dataset comprises 603 molar teeth and 783 anterior/premolars.

Table 1 The distribution of case difficulty items

Classification task

The results from 10-fold cross validation, accuracy, precision, recall, and F-1 score are presented in Table 2. Additionally, ROC curves of models are shown in Fig. 1. Inception v2 and DINO models had the best cross-validation accuracy, at 91.05% and 91.04%, respectively. VGG16 and Inception v2 models had the best AUC score, at 94.36% and 92.42%, respectively. VGG16 model had the best overall precision, recall, and accuracy across all models.

Table 2 The results from cross validation accuracy and precision, recall, and F1-score of all models in the test set
Fig. 1
figure 1

Receiver Operating Characteristic (ROC) curves illustrating the prediction performance of models on the test set, with each model represented by a distinct color

In Fig. 2 error samples of VGG16 model are illustrated. Error analysis showed that false predictions in “Easy” category predominantly occurred in teeth with reduced canals. False predictions in the “Hard” category were mainly associated with anterior tooth with open apex and long tooth.

Fig. 2
figure 2

Error samples in VGG16 predictions. a, reduced canal b, open apex c, long tooth

Regression task

Given its superior classification performance, the VGG16 model was selected for the regression task. The average mean squared error was recorded at 10.32. The model can predict the difficulty score with an error margin of ± 3.21.

Clinician assessment

The evaluation metrics of human performance is provided in Table 3. All human scores were lower than deep learning models. The Fleiss Kappa interobserver agreement between all groups, general dentists and endodontists were 0.22, 0.54, and 0.39, respectively. General dentists have moderate agreement but the level of agreement between the endodontists and all observers is fair.

Table 3 Human performance on the test set

Discussion

Assessment of case complexity remains a key challenge in endodontic diagnosis and treatment planning, requiring comprehensive evaluation through clinical examination, radiographic analysis, and understanding of the operator’s skills [1]. Attempting procedures beyond one’s capabilities risks intraoperative errors and subsequent harm to patient health, while also exposing clinicians to potential legal repercussions [8]. Thus, guidelines have been developed to determine case difficulty for procedures like molar extraction and root canal therapy, aiding clinical decision-making. Our present study demonstrates for the first time that deep learning models can predict endodontic treatment difficulty from periapical radiographs with high accuracy.

Deep learning has garnered considerable interest in medicine owing to its high learning capacity and demonstrated ability to automate intricate diagnostic and treatment planning tasks. In endodontics, deep learning has shown promise in detecting vertical root fracture [3] and special tooth anatomies like C-shaped canals [27], and taurodontism [28]. They may also assist in gauging treatment complexity to inform planning and referral needs. For instance, CNNs have been applied to categorize third molar surgery difficulty [29]. We examined several state-of-the-art convolutional neural network architectures for this classification task. Among the supervised models examined, Inception v2 achieved the highest cross-validation accuracy at 91.05%, while VGG16 demonstrated the best overall test performance - attaining 87.62% accuracy, 94.36% AUC, with the top precision, recall and F1-scores. The self-supervised DINO model narrowly exceeded VGG16 in cross-validation accuracy, while most other SSL techniques failed to match these top supervised models. While self-supervised pretraining demonstrates promise for medical imaging tasks, it did not improve overall accuracy in this study compared to supervised training alone. Error analysis revealed the majority of model errors occurred in cases with reduced canal visibility, long tooth lengths, and open apices. These features were severely underrepresented in the dataset, indicating that SSL may still be advantageous given sufficient sample diversity. Nevertheless, all deep learning models showed higher precision and recall than human raters. The low inter-rater agreement highlights issues with consistency using the AAE guideline. Automated AI assessment could address these reliability limitations while improving accuracy.

For root canal therapy, the AAE case difficulty assessment is widely utilized, and had proved useful in predicting endodontic mishaps [30], obturation length [1] and 4-year clinical success rate [31]. It categorizes cases as “low difficulty,” “moderate difficulty” or “high difficulty” based on both patient-related and tooth-related factors. However, we only assessed simple tooth related factors that can be assessed on periapical radiographs using “simple assessment” in AAE’s EndoCase application. We combined low and moderate difficulty into “easy” category and high difficulty into “hard” category for our classification analysis, since low difficulty only possess a very few cases of anterior and premolar teeth. The present study is the first time to incorporate deep learning in endodontic case difficulty assessment. By developing an objective, computational method, we aimed to address longstanding issues with subjectivity in conventional human-based assessments. Our model could help standardize case selection for dental students and junior clinicians by supplementing evaluation with a data-driven approach.

In addition to classification, we applied the model for regression to predict individual difficulty scores utilizing the scoring system of AAE’s EndoCase application. This scoring system developed by endodontic specialists categorizes complexity on a scale starting at 7. VGG16 predicted scores to within ± 3.21 units. Predicting the overall difficulty level rather than a binary label may offer useful clinical insights. Higher scores could flag challenging cases requiring more appointment time or specialist referral to reduce clinician fatigue and improve outcomes. With further validation, AI-predicted difficulty scoring could potentially assist in balanced pre-operative case scheduling. Further studies are needed to assess the clinical significance of difficulty scores. Moreover, there is a need to develop AI-optimized rating guidelines, better suited to algorithmic analysis. Retooling for AI compatibility could improve assessment standardization.

This study had some limitations. By performing binary classification rather than predicting specific difficulty factors, some diagnostic detail was lost. Additionally, periapical radiographs provide limited two-dimensional information compared to 3D modalities like cone-beam CT. Assessing complex anatomical factors in 3D teeth using 2D images presents challenges. For example, accurate characterization of canal curvatures, divisions and root morphologies like radix ento/paramolaris is difficult without clear 3D visualization. Furthermore, clinical examination is imperative for comprehensive assessment of patient-specific factors as well as tooth inclination and rotation. Moving forward, integrating three-dimensional imaging and patient record details could enhance modeling capabilities. Future work should also aim to elucidate individual tooth characteristics driving treatment complexity.

Conclusion

This pilot investigation highlights the promise of deep learning to automate endodontic difficulty assessment as a clinical decision support tool. With further refinements to models and data sources, such an approach could potentially help standardized preoperative evaluation.