Introduction

Background

According to conservative estimates, half a billion dental diagnostic radiological examinations are performed annually, with a global average of 74 per 1000 inhabitants, representing 13% of all diagnostic radiology testing [1]. Three hundred million intraoral radiographs are produced annually in the US and the EU [2, 3]. An increasing share is digitally acquired due to clinical advantages, radiation protection considerations and financial barrier lifting. Being non-physical, digital radiographs tend to be indefinitely stored, leading to accumulation in vast archives, especially in large institutions.

Digital X-ray images are accompanied by standardized metadata, ideally generated during production. However, manual recording is often necessary, leading to inaccuracies attributed to lack of staff motivation or training [4]. Deficiencies in healthcare datasets are a well-documented problem associated with labor repetition and high repair cost.

Automated algorithms could become a third party responsible for generating standardized, high-quality metadata, allowing labor reallocation to more creative tasks and uplifting metadata value above the cost of maintenance if deployed at scale.

An intraoral radiograph’s anatomical region is a piece of metadata consisting of predetermined anatomical classes that correspond to standard radiographic projections, currently manually recorded by the human operator. Reliable recording is vital for diagnosis and the benefits of DICOM and is a prerequisite for the implementation of hanging protocols and the creation of standardized radiographic layouts [5, 6]. Additionally, it can be valuable for machine learning model development [7] and the searchability of radiographic archives. Automated classification exclusively from pixel data would enable its generation on the modality or database level and eliminate the need for manual preselection.

Since the extent of the presumably depicted anatomy is known, it can be expressed as a problem of image classification among predefined classes, a fundamental problem in the field of computer vision. Recent advancements, especially after the introduction of the still-relevant AlexNet architecture in the ILSCVR competition [8,9,10], have allowed the domination of the field by a variety of algorithms where neural networks (especially convolutional ones) achieve relatively good performance in the task of classifying natural images into different classes [11].

The subsequent release of libre software that abstracts underlying development processes enabled the rapid development of relevant applications by independent researchers. In 2017, over 300 applications related to radiological imaging had been published [12], while in the US, major organizations in radiology recently issued a roadmap for future research in the field of machine learning in relation to radiology [7].

All the previously discussed benefits could be possibly provided by employing convolutional neural network architectures to achieve automated classification of intraoral radiograph anatomical regions.

In addition, exposing radiology workers to this recently introduced field through applications designed to eliminate trivial tasks can enhance familiarization with its concepts and shortcomings, leading to wider acceptance without the potential implications of diagnostic applications.

The purpose of this study is to systematically develop literature-based neural networks architectures (models) capable of classifying the anatomical region of intraoral radiographs based exclusively on pixel data, as well as to deduce the most appropriate architectures and training strategies by cross-comparing model performance on a predefined dataset. By using libre software, a simple model development methodology, a small dataset and limited computational and financial resources, wide adoption and reproducibility are facilitated. To our knowledge, no similar studies currently exist in the literature.

Materials and methods

The STARD 2015 [13] and CLAIM [14] checklists for the standardization and enhancement of the quality of diagnostic accuracy and artificial intelligence studies were followed where applicable.

Dataset generation

The present study is a retrospective study utilizing archived intraoral radiographs obtained for diagnostic purposes. No subjects were exposed to X-rays for the purposes of this study. All subjects provided written consent.

Adult patients of any age and gender were included in chronological order without further inclusion or exclusion criteria. Non-adult patients were excluded. Included images were digital periapical or bite-wing radiographs obtained by the same modality (PSP plates, SOREDEX Scanora scanner).

The original uncompressed image data were fully anonymized and randomly shuffled by a hash-based algorithm. The dataset was evaluated for content relevance, technical and visual quality and proper classification by one evaluator with 8 years of experience, excluding images of inadequate content or poor quality.

Twenty-two unique anatomical classes were identified. Class definitions assumed that each projection clearly depicts the region of three consecutive teeth, except for the anterior classes 12–22 and 32–42 which included four. Classes LBW1, LBW2 and RBW1, RBW2 were merged due to data scarcity.

The resulting dataset was then randomly split into an 80% training subset and a 20% validation subset for model training. Proper dataset and split sizes were determined by a pilot study.

Model definition

Model definitions and training methodology were based on the dominant initial choices derived from an extensive review of the literature [11] and are described in Table 1.

Table 1 Model definitions. MLP: multilayer perceptron, GAP: Global Average Pooling

All models were developed using the Keras API v2.2.4, the Anaconda distribution of Python programming language v3.6.7 and TensorFlow v1.14 as a backend, by adding convolutional architectures as feature extractors on top of a multilayer perceptron classifier and trained using fully supervised learning with backpropagation of error.

Initially, a fully connected two-layer multilayer perceptron (1024 and 512 wide) with flattened input was trained as a baseline classifier. Then, a simple convolutional network of three convolutional blocks as described in TensorFlow’s documentation (convolutional layers of width 64, 128 and 128 respectively, a 3 × 3 filter size, and a max pooling level) was added as a baseline feature extractor.

A group of more advanced and innovative convolutional networks were then consecutively added; the deep but simple convolutional architecture VGG16 [15], the MobilenetV2 as a balanced architecture against model size and performance [16] and the Inception-ResNetV2 as a high-capacity, high-performing architecture especially suitable for small datasets [17, 18].

Additional models were generated with the insertion of batch normalization layers [19] and the use of both the flattening or the global average pooling layer [20] as bridging between the feature extractor part and the MLP classifier.

Dropout layers of 0.20–0.50 rates [21] and three different data augmentation strategies applied on-the-fly were used as regularization (described in Table 2).

Table 2 Data augmentation strategies with output examples

The final layer of every model was a softmax-activated dense layer with a range equal to the number of classes (22), so that model output could be expressed as multiple percentages of per class prediction confidence that add up to 100%. The class with the highest confidence was the top model choice for the evaluation of top-1 accuracy.

All models were initialized with default Keras parameters and trained with categorical cross-entropy as a loss function, Adam optimizer [22], a batch size of 64, and learning rate reduction by 50% in validation top-1 accuracy plateaus. Training lasted 100 epochs with early stopping if validation top-1 accuracy or loss function stopped improving after a set number of epochs. ReLUs [23] were used as activation functions. All other hyperparameters were constant.

Input resolution was 224 × 224 grayscale, or RGB in models requiring three-channel input. The corresponding Keras preprocessing function was used; otherwise, images were rescaled to the 0–1 range.

Reproducibility and fair comparisons among models were facilitated by a common seed value for all pseudo-random number generators.

Class imbalance was mitigated using a weighted version of the loss function based on class weights calculated on the validation subset.

Model evaluation

Evaluation was performed on a separate test dataset, not involved in model training, consisting of 261 intraoral radiographs of patients, balanced for both sexes and all five age groups described in the NHANES [24], to reduce dataset bias. Its size allows the detection of large discrepancies in accuracy between the test and validation subsets, without being large enough to impact the training dataset.

Total training time, per sample prediction time, loss function minimization, top-1 accuracy, precision, recall, F1-score and area under the curve were obtained by Keras and Scikit-Learn.

Metrics were macro-averaged from all classes, where applicable. Since some classes were considerably underrepresented, in-depth per class performance analysis was deemed out-of-scope.

Top-1 class predictions for all test images were dichotomized to either true or false predictions in a one-vs-all fashion against the ground truth, then used for statistical analysis.

A significance level of 0.05 was set for all statistical tests. Comparison of the proportions of misclassifications across all models was performed with Cochran's Q test. Pairwise comparisons across models were performed with McNemar’s test. P values were adjusted with the Bonferroni correction [25,26,27].

Test calculations were performed with IBM SPSS statistical package version 25, p value adjustments with R statistical package version 3.6.1 and ROC curves were calculated with Scikit-Learn version 0.22.

Results

Dataset generation

Out of a total of 17,781 images, 15,254 were accepted for further processing, resulting in a training subset of 12,213 images and a validation subset of 3041 images. Due to prior randomization and anonymization, the exact number of subjects included in the study is unknown. Class weights ranged from 1 to 26.65. A full dataset description is given in Table 3.

Table 3 Per anatomic class sample populations and weights. RBW: right bite-wing, LBW: left bite-wing

A total of 36 models were trained and evaluated. Summaries of model training history and model performance across the test dataset are presented in Table 4. Learning curves for each model are found in “Appendix of ESM”.

Table 4 Training history and model performance evaluation on the test dataset. AUC: area under the curve

Comparison among all models

Cochran's Q test indicates a statistically significant difference across the 36 models in terms of the proportion of misclassifications on the test dataset, χ2(35) = 3034.949, p < 0.001.

Post hoc analysis—pairwise comparisons

An overview of the levels of statistical significance for the Bonferroni-adjusted p values for model pairwise comparisons with McNemar’s test is shown in Fig. 1. A complete table of p values is found in Appendix.

Fig. 1
figure 1

Significance levels of multiple pairwise comparisons among models with McNemar's test, adjusted with the Bonferroni correction. NS: non-significant

Failed models

MLPs without batch normalization (0, 1, 2) failed to train (top-1 accuracy 0.05–0.10, F1-score 0–0.01 and AUC equal to 50). Therefore, a statistically significant difference with any other model (p < 0.001 for all pairs) was to be expected, except for members of the same group.

Comparisons among advanced models

No statistically significant differences were observed among advanced group models (18–35), regardless of bridging layer or data augmentation strategy (p > 0.05 for all pairs). Advanced models had top-1 accuracy ranging from 0.81 to 0.89, F1-score between 0.71–0.86 and AUC of 0.86–0.94.

Comparison of baseline models against advanced models

Baseline models 3, 4, 6, 7, 12 and 13 showed no statistically significant differences against the advanced group, except for pairs 4–18 (p < 0.05), 4–22 (p < 0.05) and 4–31 (p < 0.05), indicative of comparable high performance (top-1 accuracy 0.77–0.87, F1-score 0.70–0.81 and AUC 0.84–0.90).

In contrast, models 5, 10, 11 and 17 showed statistically significant differences with all models in the advanced group (p < 0.001 for all pairs) as a result of poor performance (top-1 accuracy 0.46–0.61, F1-scores 0.33–0.48, AUC 0.67–0.75).

Models 14, 16 showed statistically significant differences with the advanced group except model 24 (p < 0.001 against models 18, 21, 22, 26–35, p < 0.01 against models 19, 23, 24 and p < 0.05 against model 20). Their top-1 accuracy was 0.69 and 0.68, F1-scores 0.60 and 0.61, respectively, and AUC 0.81 for both.

Comparisons among baseline models

Models 11 and 17 (top-1 accuracy 0.46, F1-score 0.33, AUC 0.67 and top-1 accuracy 0.47, F1-score 0.36, AUC 0.68) had the lowest performance among all models (except for models 0, 1, 2) and showed a statistically significant difference with any other baseline model (p < 0.001 except pairs 11–10, 17–10 where p < 0.01 and 11–5, 17–5 where p < 0.05).

Models 5 and 10 also showed a statistically significant difference compared to all baseline models (p < 0.001 with models 3, 4, 6, 7, 12, 13 and p < 0.05 with models 8, 11, 15, 17) as well as low performance (top-1 accuracy 0.61, F1-score 0.47, AUC 0.74 and top-1 accuracy 0.61, F1-score 0.48, area AUC 0.75), excluding pairs 5–9, 5–10, 5–14, 5–16, 10–5, 10–9, 10–14, 10–16 where no statistically significant difference was observed.

Models 8, 9, 14, 15, 16 (top-1 accuracy 0.68–0.75, F1-score 0.58–0.67, AUC 0.79–0.84) showed a statistically significant difference in the proportion of errors against models 11, 17 (p < 0.001) and the highest performing baseline model 7 (p < 0.001 with models 9, 14, 16 and p < 0.01 with models 8, 15). In addition, model 8 showed a statistically significant difference with models 5 and 10 (p < 0.05). Models 14, 15, 16 also showed significant differences in the pairs 14–12 (p < 0.01), 14–13 (p < 0.01), 15- 5 (p < 0.05), 15–10 (p < 0.05) 16–12 (p < 0.001), 16–13 (p < 0.05). This implies that this subset of models lies in the middle in terms of performance of the baseline model group.

Baseline models 3, 4, 6, 12, 13 (top-1 accuracy 0.77–0.84, F1-score 0.70–0.79, AUC 0.84–0.90) showed a statistically significant difference with models 5, 10, 11, 17 (p < 0.001 except pair 4–10 where p < 0.01), as well as pairs 4–7 (p < 0.05), 12–14 (p < 0.01), 12–16 (p < 0.001), 13–14 (p < 0.01), 13–16 (p < 0.05), while no statistically significant difference was observed between them.

Baseline model 7 was the highest performing model of the baseline group (top-1 accuracy 0.87, F1-score 0.81, area AUC 90). There was a statistically significant difference with every other model in the baseline group (p < 0.001 with models 5, 9, 10, 11, 14, 16, 17, p < 0.01 with models 8, 15 and p < 0.05 with model 4) except for the high-performing baseline models 3, 6, 12, 13, with which no statistically significant difference was observed.

Discussion

This study investigated whether neural networks could classify the anatomical region of intraoral radiographs based solely on image data, and the influence of their architectural elements. According to our findings, it is feasible with an expected top-1 accuracy of 80–90%, when trained with small datasets.

Intraoral images usually depict very similar anatomical features, especially when they are part of a series from the same patient, where a significant amount of overlap is expected. Therefore, an understanding of the relative positioning of the depicted structures is essential for their classification. On the contrary, deep learning classifiers have mostly been studied with images of discrete objects against different backgrounds. Bearing that in mind, testing with a multitude of architectures was deemed appropriate as some of them could fail to address this unique challenge.

Multilayer perceptrons (MLPs)

MLPs without normalization failed to train. However, the introduction of batch normalization [19] allowed training comparable to that of advanced models. This finding indicates that input and layer normalization may allow non-convolutional architectures to perform adequately in radiographic image classification tasks. However, these models performed poorly alongside data augmentation, indicating a dependency on input images with little variation.

Baseline convolutional models

Most baseline convolutional models achieved adequate performance. Batch normalization [19] improved training and performance in some architectures. A typical data augmentation strategy resulted in better training, but the introduction of an aggressive strategy led to deteriorating performance.

Advanced models

All advanced models had comparable performance and outperformed most baselines. However, due to the strict nature of the Bonferroni p value adjustment, subtle differences may not be elucidated, while different errors may occur on the test dataset for each model. Furthermore, some models trained irregularly, as it is evident by their learning curves.

Data augmentation

Data augmentation is a common technique for generating more samples from small datasets, and it is considered vital to prevent overfitting, especially in models with large capacities. It can also function as a measure of a model's resilience to improper input, as represented by the aggressive strategy.

Baseline model performance was severely downgraded with aggressive data augmentation, although some were robust or even benefited from subtle transformations. Most advanced group models were resistant to aggressive data augmentation and capable of managing degraded images.

Global average pooling

Using a global average pooling layer [20] in baseline models significantly limited their performance, a finding directly associated with the resulting marked reduction in model capacity.

On the other hand, parameter reduction seemed to favor the advanced group, where it showed no detrimental effects even when applied concurrently with aggressive data augmentation. A regularization effect could also be observed in many models’ learning curves.

Clinical recommendations

Based on the above, the use of models from the advanced group trained with aggressive data augmentation and a Global Average Pooling layer [20] as a bridge between the feature extractor and the classifier parts is recommended.

Choosing an architecture should be a compromise between other parameters, such as resources and time availability. In a clinical setting where long waiting times can be a major disadvantage, a smaller and faster model such as MobilenetV2 might be more useful, while the InceptionResnetV2-based architecture could be better suited for tasks such as database maintenance.

Limitations

Our models demonstrate all limitations inherent in most deep learning models. They were developed empirically on natural images, which prior studies have shown not to be the same as X-ray imaging feature-wise [28]; they lack theoretical explanation for their performance and have high computational costs not favoring experimentation. They also lack outcome justification and are vulnerable against specific images containing irrelevant features able to trigger a predictable output (adversarial samples), making trusted input a necessity.

Such models supposedly require large amounts of data to train. Building large data sets with medical data is a laborious undertaking with serious ethical, legal and financial considerations, while for smaller datasets containing well-structured images, solutions other than neural networks may be more efficient. However, in this study good performance was achieved with limited data.

In addition to the fore-mentioned, a significant limitation is the introduction of dataset bias in model outputs. Sources of dataset bias in this study were the use of only high-quality radiographs out of a single modality, which were diagnostic and contained mostly tooth or prosthesis imaging (rarely depicted exclusively bone structures or contained instruments). Equally important is the fact that these models are only able to replicate the classification criteria applied by the evaluator (evaluator bias). The above could significantly limit our models from performing consistently under different conditions. Training with a multitude of diverse datasets could partially resolve this issue. Currently, model generalizability to datasets with different specifications is not guaranteed.

Another limitation is class imbalance, a direct result of uneven clinical demand and the retrospective nature of the study. Creating a perfectly balanced dataset would require either excluding a large portion of our dataset, with a possible decline in performance, or a prospective study design, exposing subjects to X-rays for study purposes and breaching the ALARA principle without clearly determined benefits. In view of the above, using a weighted loss function seems to be an appropriate compromise. However, all models exhibited their worst performance in the underrepresented classes.

Recommendations for further research

Further gains in performance are anticipated by training with a larger and more balanced dataset, by using loss functions that count in the inherent order of anatomical regions, with the transfer learning and fine-tuning technique and with the combined use of model ensembles.