Introduction

The dental periapical radiography is a kind of radiology image that is essential for the diagnosis of dental hard tissue diseases, such as decay, periapical periodontitis, and periodontitis [1,2,3]. Based on the bisecting technique, a dental periapical radiography can shape several (usually 3–4) intact teeth and periodontal structures. However, the clarity of the radiography is strongly relied on the operator’s techniques, including the setting of project angle, radiation time, and doses [4]. Dental periapical radiographs are abundantly produced in daily clinical practice and read by a dentist to generate diagnosis reports, and provide guidance for treatment plans, or evaluate the outcomes. The reading and analyzing work consumes much mental effort and takes a substantial amount of time from the dentist, and radiographic interpretations tend to be with a variation between observers [3, 5, 6]. Moreover, there are misdiagnosis of non-endodontic lesions as periapical periodontitis lesions [7]. So, it is with a great prospect to develop an auxiliary diagnosis system for dental periapical radiographs.

There are kinds of algorithms applied in dental X-ray image processing, feature extraction, and segmentation, such as contour extraction [8, 9], adaptive threshold [10,11,12], iterative thresholding [13], level set [14, 15], mathematical morphology [16], Fourier descriptors [17], hierarchical contour matching [18], weighted Hausdorff distance [19], texture statistics techniques [20], local singularity analysis [21], semi-supervised fuzzy clustering algorithms [22, 23], neutrosophic orthogonal matrices [24], and so on. However, image features of target objects were always shallowly extracted in these algorithms and rely heavily upon manual definitions, which may raise issues of artificial errors.

With the development of artificial intelligence techniques, support vector machine (SVM) [25] and artificial neural networks [26,27,28,29] were introduced to detect lesions on dental images, which achieved better results compared to “rule-based” computer algorithms. Li et al. [25] used principal component analysis to extract pathological characteristics of the clinical images to train a SVM classifier, achieving automatic and fast clinical segmentations. Yu et al. [26] proposed a three-layer neural network in which normalized autocorrelation coefficients were treated as input features, and back-propagation algorithm was used to construct the weights of classifier to distinguish decayed teeth from normal. El-Bakry et al. [27] trained a neural network to classify sub-images which contain dental diseases or not and constructed a fast algorithm for dental disease detection by performing cross-correlation in the frequency domain between input image and the input weights of the neural networks. Tumbelaka et al. [28] used local image differentiation technique to extract edges as basis image features and then analyzed them by texture descriptors to obtain image entropy, which was further sent to artificial neural networks to detect the infected regions. However, in these researches, image features should be manually defined and pre-calculated before sent to classifiers like SVM or neural networks.

In recent years, deep convolution neural networks (CNNs) have been invented which can take the raw image data as input and did a good job on image classifications and object detections [30]. Srivastava et al. [29] constructed a deep fully convolutional neural network to mark caries on bitewing radiographs with precision reported to be 61.5 and recall 81.5. Al Kheraif et al. [31] performed teeth and bone segmentation work on panoramic radiographs using hybrid graph-cut technique and convolutional neural network. Lee et al. [32] implemented decay classification based on GoogLeNet Inception v3 CNN network, while the teeth for detection should be manually segmented before sent to neural networks, and the position of decay lesions was not precisely located in their research. Ekert et al. [33] applied deep convolutional neural networks (CNNs) to detect apical lesions on panoramic dental radiographs, but the CNN’s sensitivity needs to be improved before clinical application. Lately, regions with convolutional neural network (R-CNN) features [34] were developed to offer solutions for object detection tasks, where target objects (regions of interest) were automatically boxed out and annotated with labels. R-CNN was then upgraded to fast R-CNN [35] and furtherly to faster R-CNN [36], with higher efficiency and better performances. Although some researchers reported teeth segmentations on panoramic dental radiographs based on faster R-CNN [37], there were rarely researches applying R-CNNs in disease lesion detection on dental periapical radiographs.

In this research, faster R-CNN was utilized to detect decay, periapical periodontitis, and periodontitis in dental periapical radiographs. Influences of network train strategies, as well as disease categories and levels of severity, on the detection outcomes were invested, with intentions to figure out to what extent the faster R-CNN will perform in disease detection in dental X-ray, and to find out what kind of diseases and which levels will be detected with higher accuracy, i.e., the indications of this deep CNN auxiliary diagnosis methods.

Methods

Data collection and annotation

In total, 2900 digital dental periapical radiographs were collected. The inclusion criteria are (1) periapical radiographs with permanent teeth, (2) the radiation exposure is proper, and (3) the position and axial direction of teeth are proper. The exclusion criteria are (1) radiographs with deciduous teeth in it, (2) the image is too bright or too dark to distinguish the lesions, and (3) the teeth in the image is severely distorted. Each digital radiography was exported with a resolution of 96 dpi at size of approximately (300–500) × (300–400) pixels and saved as a “JPG” format image file with a unique identification code. These image files were collected anonymously to ensure that no private information, such as patient name, gender, or age, is revealed. Afterward, an expert dentist with more than 5 years of clinical experience draws minimum bounding boxes to frame each diseased area of decay, periapical periodontitis (labeled as periapi, for short), and periodontitis with bone resorptions (labeled as periodo, for short). Each type of disease was graded by three levels of severity, which are mild, moderate, and severe. Thus, a total of nine label names for bounding boxes were annotated: decay-mild, decay-moderate, decay-severe, periapi-mild, periapi-moderate, periapi-severe, periodo-mild, periodo-moderate, and periodo-severe. The criteria are as follows:

  • Decay-mild: the decay invasion depth less than 1/3 of the tooth sidewall or roof width;

  • Decay-moderate: the decay invasion depth between 1/3 and 1/2 of the tooth sidewall or roof width.

  • Decay-severe: the decay invasion depth larger than 1/2 of the tooth sidewall or roof width;

  • Periapi-mild: the width of the periapical periodontitis area (the miner axis) less than 1 mm;

  • Periapi-moderate: the width of the periapical periodontitis area (the miner axis) between 1 and 3 mm;

  • Periapi-severe: the width of the periapical periodontitis area (the minor axis) larger than 3 mm;

  • Periodo-mild: the bone resorption depth less than 1/3 of the tooth root length;

  • Periodo-moderate: the bone resorption depth between 1/3 and 1/2 of the tooth root length;

  • Periodo-severe: the bone resorption depth larger than 1/2 of the tooth root length;

The coordinates of points in the image were set as pixel distance from image’s left top corner, where the tooth bounding box could be recorded by its top left and bottom right corner points (xmin, ymin, and xmax, ymax).

Train and validation of faster R-CNNs

An object detection tool package [38] based on TensorFlow was utilized to construct faster R-CNN, which was one of the state-of-the-art object detectors for multiple categories. The training process was executed on a GPU (Quadro RTX 8000, NVIDIA, USA), with 48 GB memory and 4608 CUDA cores. The algorithms were running backend on TensorFlow version 1.13.1, and the operating system was Ubuntu 18.04. The training parameters were configured as: anchor scales [0.1, 0.2, 0.4, 0.8, 1.6], iterations 100,000, initial learning rate 0.003 and then reduced to 0.0003 after 30,000 iterations, and further to 0.00003 after 60,000 iterations. Also, a pre-trained model on the Coco dataset, version 2018-01-28, was loaded as a fine-tune checkpoint.

Faster R-CNN was trained and validated by several strategies of a different organization of annotated data (Table 1). Firstly, all annotated images with nine label names were treated as ground truth and were used to train and validate the faster R-CNN network as a baseline. Secondly, the level attribution of each bounding box (bbox) annotations was ignored, that is, decay-mild, decay-moderate, and decay-severe were all relabeled to be decay, same to periapi and periodo, to train and validate another fast R-CNN named Net A. After that, three faster R-CNNs (named Net B1, B2, and B3), one for each of the three disease name classifications, were trained, respectively. Similarly, C1, C2, and C3 were trained for each level of bboxes. Fivefold cross-validation was applied for every network listed in Table 1, where all included images were randomly and evenly divided to be five parts. The train and validation processes were run for five turns with every part used as validation dataset, and the rest four parts as train dataset. Before each run, the trained parameters were initialized or re-initialized to “forget” the trained memories and the networks were “renewed.”

Table 1 Train and validation strategies

Several metrics were calculated on validation dataset for each label name, including intersection over union (IoU), precision, recall, average precision (AP, also equal to area under curve, AUC). Metrics calculated from the fivefold cross-validation procedure were combined to calculate a mean value.

Firstly, the predicted bboxes were compared with ground truth bboxes, and IoU is defined as:

$$IOU =\frac{\hbox{Area}_{\rm pred}\bigcap \hbox{Area}_{\rm gt}}{\hbox{Area}_{\rm pred}\bigcup \hbox{Area}_{\rm gt}}$$
(1)

where \(\hbox{Area}_{\rm pred}\) and \(\hbox{Area}_{\rm gt}\) represent the areas of the predicted bbox and its corresponding ground truth bbox. The threshold of IoU was set to be 0.5, that is, if a predicted bbox whose IoU with corresponding ground truth bbox is larger than 0.5, it will be treated as true positive bboxes. Then precision and recall could be calculated:

$$\hbox{Precision}=\frac{\hbox{TP}}{\hbox{Pred}}$$
(2)
$$\hbox{Recall}=\frac{\hbox{TP}}{\hbox{GT}}$$
(3)

where TP represents the count of true positive bboxes, while Pred is the count of predicted bboxes and GT is the count of ground truth bboxes. It can be easily inferred that precision defined here is equivalent to positive predictive value in clinical diagnosis, and recall is equivalent to sensitivity.

Receiver operating characteristic (ROC) curve is an important metric to evaluate diagnosis tools. However, the calculation of ROC curves relies on counts of true positive samples, true negative samples, false positive samples, and false negative samples. But negative samples are not applicable here in this kind of object detection tasks, because no bonding box has been drawn on negative targets (areas without disease lesions) by ground truth. Thus, ROC curves cannot be drawn. Instead, we calculated precision–recall curves in this study, which have deep connection with receiver operator characteristic curves; both can evaluate the accordance between test and reference [40]. The area under precision–recall curve was calculated as average precision (AP), which is widely used as an important metric of the performances of networks in object detection tasks. For each label name, the bboxes were predicted by faster R-CNN with confidence scores. A threshold of confidence score will be set to decide which of the predicted bboxes to finally output. If the confidence score of one predicted bbox was set as threshold, predicted bboxes whose confidence scores larger than the threshold will be finally output and matched with ground truth bboxes to produce a precision and a recall value. After every confidence score of predicted bboxes set as threshold, a series of precision and recall value pairs will be produced to draw a P–R curve. Thus, average precision (AP) [39] is defined as the area under smoothed P–R curve:

$$\hbox{AP}=\sum ({r}_{n+1}-{r}_{n}){p}_{{\rm interp}}({r}_{n+1})$$
(4)

where \({p}_{{\rm interp}}(r)\) is the maximum precision for any recall values exceeding \(r\):

$${p}_{{\rm interp}}\left({r}_{n+1}\right)=\underset{\stackrel{\sim }{r}\ge {r}_{n+1}}{\mathrm{max}}p(\stackrel{\sim }{r}).$$
(5)

Statistical analysis

To evaluate the performance of each network on diseases and levels, analysis of variance (ANOVA) was used. Firstly, the performances of baseline were compared with Net A across all three diseases. Since there was no level attribution of bboxes output from Net A, the level attributions of predicted bboxes by baseline were ignored for comparison, that is, decay-mild, decay-moderate, and decay-severe were all treated as decay, also to periapi and periodo. Two-way ANOVA was applied with strategy names and disease names set as independent variables, and metrics calculated on validation dataset including IoU, precision, recall, and AP were set as dependent variables. Secondly, the performances of baseline were compared with Net B (composed of B1, B2, B3) and Net C (composed of C1, C2, C3). Multi-way ANOVA was applied, where strategy names, disease names, and level names were set as independent variables, and metrics calculated on validation dataset including IoU, precision, recall, and AP were set as dependent variables.

A statistical software program (IBM SPSS Statistics, v19.0; IBM Corp) was used for the statistical analysis. For those where the interactions of independent variables were significant, simple effects were analyzed with pairwise comparisons adjusted by the Bonferroni’s method (α = 0.05 for all tests).

Results

As shown in Fig. 1, although with some miss diagnosis, the diseases detected by faster R-CNNs were basically close to the ground truth. Performances of baseline and Net A were compared, and the metrics calculated on validation dataset are shown in Table 2. Two-way ANOVA results (Table 3) have shown that the strategy had no significant influence on all metrics, except AP, while the disease had a significant influence on all metrics. The interaction of strategy and disease had no significant influence on all metrics, except precision. Non-statistically significant interactions were removed from the analysis model, and \(F\) values as well as \(P\) values were re-estimated and shown in brackets. Otherwise, if interactions between factors were significant, simple effects were analyzed based on estimated marginal means. Factors with significant influence on according metrics were forwarded to pairwise comparisons adjusted by the Bonferroni’s method. Values of metrics and results of pairwise comparisons are illustrated in Fig. 2, where values with significant differences were annotated by different letters. As it can be seen, with comparison between different diseases, decay tends to be predicted with higher IoU, precision, and AP than periapi or periodo, while periodo tends to be predicted with higher recall (Fig. 2a). With comparison between different strategies (networks), Net A performed as good as baseline and even slightly better than baseline on AP value for prediction of periapi (Fig. 2b).

Fig. 1
figure 1

Samples of dental periapical radiographs with lesions detected by networks constructed and trained in this research. Ground truth was manual annotations of an expert dentist. The detections shown in baseline were output from neural network trained by all disease categories and all severity levels. The detections shown in Net A were output from neural network trained by all disease categories, ignoring severity levels. The detections shown in Net B were combination of detections from Net B1, Net B2, and Net B3, where decays were detected by Net B1, periapical periodontitis was detected by Net B2, and periodontitis was detected by Net B3. The detections shown in Net C were combination of detections from Net C1, Net C2, and Net C3, where mild-level diseases were detected by Net C1, moderate-level diseases were detected by Net C2, and severe-level diseases were detected by Net C3

Table 2 Metrics calculated for baseline and Net A on validation dataset by fivefold cross-validation method (mean ± SD)
Table 3 Two-way ANOVA results of comparison of baseline and Net A
Fig. 2
figure 2

Metrics calculated for baseline and Net A; values with significant differences were annotated by different letters. a Comparison between diseases; b comparison between strategies

Performances of baseline and Net B, C were also compared, and the metrics calculated on validation dataset are shown in Table 4. Multi-way ANOVA results (Table 5) show that all independent variables, as well as their interactions, had significant influences on all metrics. Simple effects were analyzed, and further pairwise comparisons were processed based on estimated marginal means and adjusted by the Bonferroni’s method. The values of metrics and the results of pairwise comparisons are illustrated in Fig. 3, 4, 5. As it can be seen, with comparison between different diseases on severe level, decay tends to be predicted with precision, recall, and AP values higher than periapi, and periapi tends to be higher than periodo, but the order was reversed on mild and moderate levels (Fig. 3). Mild decay tended to be predicted with lower IoU than mild periapi and mild periodo (Fig. 3). With comparison between different Levels, severe level tended to be predicted with precision, recall, and AP values higher than moderate level, and moderate level tended to be higher than mild level, particularly for decay and periapi (Fig. 4). With comparison between different strategies (networks), Net B and Net C performed better than baseline on certain circumstances, but Net C failed to predict mild decay (Fig. 5).

Table 4 Metrics calculated for baseline and Net B, C on validation dataset by fivefold cross-validation method (mean ± SD)
Table 5 Multi-way ANOVA results of comparison of baseline and Net B, C
Fig. 3
figure 3

Comparison across diseases of metrics calculated for baseline and Net B, C; values with significant differences were annotated by different letters

Fig. 4
figure 4

Comparison across levels of metrics calculated for baseline and Net B, C; values with significant differences were annotated by different letters

Fig. 5
figure 5

Comparison across strategies of metrics calculated for baseline and Net B, C; values with significant differences were annotated by different letters

Discussion

Metrics included in this research have their clinical significances. As in clinical use, the overlapping rate between predicted disease areas and ground truth must reach a certain level that is beneficial for the dentist to position the potential disease. The IoU, which is defined as an overlapping area over the union area, can measure the precision of allocation of target diseases. The larger the IoU is, the more precise the location of target disease, and predicted target will completely overlap with target disease when IoU reaches 1. The IoU is always high. However, high IoU is not directly related to the diagnostic performance to predict either the presence/absence or severity of target diseases. Precision defined in this research is equivalent to positive predictive value in clinical diagnosis, and recall is equivalent to sensitivity. Thus, precision here represents the chance that a predicted disease bbox truly has the disease within it, and recall represents the probability that a prediction will indicate “disease” among those with the disease. AP, which is defined as the area under the precision–recall curve, can test the overall performances of the network, and the closer the AP reaches 1, the better the network model is. What’s more, precision–recall curves have a deep connection with receiver operating characteristic curves [40], both of which are able to evaluate the accordance between test and reference. However, when dealing with highly skewed datasets where the class distribution is not even, precision–recall (P–R) curves give a more informative picture of an algorithm's performance [40]. Precision, recall, and average precision did not show high performances, which implied the difficulty of correctly detection of dental lesions (Tables 2, 4).

As shown in the results, many values of precision, recall, and AP were less than 0.5, which is the random chance of two-category classifications. But the disease detection task here is not only to classify the multicategory disease lesions, but also to detect the actual position and size of the lesion, and the performances were the overall accuracy. Think of a small target square area like the mild decay in the dental radiography image, if determined its position and size randomly, the chance will be very small (about 0) to correctly match with the truth (IoU larger than 0.5), let alone the subsequent chance of multicategory classifications. Although the overall performance of CNNs has many values less than 0.5, they are still better than chance.

Different strategies and networks were designed in this research, and their influence on metrics was tested. Net A was designed to ignore levels for disease detection, which is reasonable for basic clinical applications, because usually we only need to know whether there are certain disease lesions or not on dental X-rays, and further manual examinations will be processed to determine the level of detected diseases. But we still want to figure out whether it will improve the recognition of disease names for the deep CNNs if we teach machine more details of the disease, such as disease levels here. However, the results of comparison between Net A and baseline show that, if only disease names were wanted (i.e., the levels of disease are not needed to output from network), baseline trained with extra disease level information performed no better than Net A with only disease name information. It can be inferred that there is no need to annotate objects with extra attributions other than what we need the deep CNNs to output.

As in baseline, we trained only one faster R-CNN network for all diseases and levels, but in Net B, we trained one faster R-CNN for each disease name, and in Net C, we trained one faster R-CNN for each level name. Thus, although Net B and Net C performed slightly better than baseline in certain circumstances, they were three times of baseline in scales of model parameters. Other than that, Net C failed to detect mild decay, which might because the differences of features between diseases within the same level were more obvious than differences of features between levels within the same disease. So, overall, Net B with trained faster R-CNN for each disease name performed better than baseline and Net C, but will cause more overheads in computation and memory than baseline.

Metrics were also compared among diseases and levels, with intention to find out the indications of faster R-CNNs in this research. The results turned out that periodontitis can be well detected among all levels, while decay and periapical periodontitis were better predicted with an increase in severity. Decay and periapical periodontitis with moderate or severe levels were in much larger scales and were more visually distinctive than mild level, and the small size objects are easy to be ignored after downsampling in faster R-CNN processes. Thus, disease lesions with too small sizes may not be indications for faster R-CNN. On the other hand, the distribution of lesions among different disease categories and severity levels is uneven in this study, which could also affect the performances of networks. There is also a trend of poorer performances with less train samples, because the networks need to be trained with enough amount of ground truth cases before they can correctly predict disease lesions. However, imbalance distribution of lesion counts across diseases and levels is the actual situation in real clinical dental radiographs. Thus, when it comes to the clinical application, radiographs should better be screened from the clinical images to form an evenly distributed train dataset.

The performances of current networks were poorer than Srivastava et al. [29] with precision reported to be 61.5 and recall 81.5 on bitewing images. This might because the ground truth used by Srivastava et al. for existence of caries was radiographic interpretation by dentists combined with clinical verification, and they did not specify the severity or size of the carious lesions. Although the performances were not sufficient to be used as diagnostic tools alone, it can prompt dentist with potential disease lesions, which is beneficial to improve the efficiency of clinical work. Furtherly, the structure of deep CNNs should be updated and some pre-image or post-image processing techniques should be carried out to improve the performances of disease detection on dental periapical radiographs, which will be our next work.

Conclusions

Some conclusions can be drawn with the constructed faster R-CNNs:

  1. 1.

    The faster R-CNNs were able to detect diseases including decay, periapical periodontitis, and periodontitis in dental periapical radiographs.

  2. 2.

    The network train strategy, disease category, and severity level all have significant influences on performances of faster R-CNNs.

  3. 3.

    It is better to train one faster R-CNN for each disease classification, rather than training only one faster R-CNN for all disease classifications. However, training one R-CNN for each severity level is discouraged, because there tends to be a drawback of performances.

  4. 4.

    Decays and periapical periodontitis with higher severity tend to be better predicted than lesions with lower severity.

  5. 5.

    In mild and moderate levels, periodontitis can be better detected than periapical periodontitis, and periapical periodontitis better than decay, but the rank was reversed in severe levels.