Keywords

1 Introduction

Currently, there is an increasing demand for dental health. Dental health issues mainly include dental diseases, dental implants, orthodontics. Although the growing number of patients seeking dental diagnoses contributed to the rapid development of the dental healthcare market, there is a significant shortage of dentists per million population, which poses a substantial burden on dentists. In clinical diagnosis, Cone Beam Computer Tomography (CBCT) is widely utilized for acquiring high-resolution 3D images of teeth, thereby offering accurate representations of dental crowns, roots, and bones. Additionally, CBCT offers the advantages of low Radiation exposure and short scanning time. On the other hand, the voxel information in CBCT images is highly complex, necessitating extensive manual segmentation to extract vital information. Therefore, this process becomes time-consuming and labor-intensive for clinicians and researchers. Thus, the development of digital dentistry and fully automated tooth segmentation methods is crucial for tooth analysis from 3D CBCT scans.

Computer vision technology has found widespread applications in the field of medical imaging. Driven by computer vision technology, digital oral cavity is rapidly developing. Automatic tooth segmentation is a primary step for tooth image analysis, and has attracted more and more research attention. Existing tooth instance segmentation methods can be categorized into two types: traditional methods and deep learning-based methods. Traditional methods, such as level set [1, 14, 15, 20], graph cut [18, 21], and template fitting [2, 29]. However, these method rely on manually designed features, which are highly sensitive to complex dental situations, requiring tedious manual initialization and correction. They often lead to suboptimal segmentation performance in complicated cases. Deep learning methods, on the other hand, are known for their automatic feature extraction, strong adaptability, and high accuracy. They have been widely adopted in medical image segmentation.

Deep learning-based tooth instance segmentation methods [7,8,9,10,11,12, 19, 22, 31] generally achieve better performance than traditional methods. However, nearly all deep learning-based methods rely on convolutional neural networks (CNNs) to extract features from CBCT images and achieve tooth detection and segmentation. None of these methods introduce attention mechanisms. CNNs’ limitations in obtaining global image information to some extent lower the model’s performance. Overcoming these limitations and improving the performance of tooth segmentation models pose challenging tasks. The widespread application of Transformers [5, 13, 25, 33] indicates that Self-Attention can effectively obtain global information of images. This makes the model based on the Self-Attention mechanism have certain advantages in the field of image segmentation. In recent years, several outstanding neural networks using Self-Attention mechanisms [4, 6, 16, 32, 34] have emerged in the medical image segmentation field. Therefore, this paper aims to construct a fully automatic tooth instance segmentation method incorporating Self-Attention mechanisms.

Inspired by the above work, we propose a fully automated tooth instance segmentation model that utilizes the Self-Attention mechanism. The model has three stages: First, we use V-Net [24] to extract tooth ROI. Next, we use a multi-task UNETR++ network [28] to predict the centroids and skeletons of teeth. This step localizes teeth, detects tooth shapes, and represents teeth. Finally, we further segment teeth within the tooth ROIs using the multi-task UNETR++ network. By combining the centroids and skeletons of teeth, we achieve tooth instance segmentation. To evaluate the performance of our method, our fully automatic tooth segmentation achieved a Dice similarity coefficient of 95.1\(\%\) and an Average Surface Distance of 0.14mm in tooth segmentation.

In summary, the main contribution of this study are as follows:

  1. 1.

    We propose a multi-stage model that is capable of fully automatic tooth instance segmentation on input 3D CBCT images.

  2. 2.

    By introducing a self attention mechanism, we have effectively improved the segmentation accuracy of our model which surpasses the performance of other comparative models.

  3. 3.

    By using multitasking learning, we successfully reduced the error in tooth surface segmentation while maintaining a high level of mask segmentation accuracy.

  4. 4.

    By evaluating our model with other CNN based models through experiments, we have demonstrated that introducing self attention mechanism can improve the performance of tooth segmentation models.

2 Related Work

Tooth Segmentation Based on Deep Learning. Inspired by 3D Mask R-CNN [17], Cui et al. [11] introduced ToothNet, an automatic tooth instance segmentation method in CBCT images. ToothNet employs 3D Region Proposal Network (RPN) [26] for tooth detection, recognition, and segmentation. Chung et al. [8] proposed the PATRCNN+TSNet method, which addressing metal artifacts in CBCT images using pose-aware techniques. Chen et al. [7] presented 3D FCN+MWT, a method that combines deep learning and traditional methods. They utilized a multi-task 3D fully convolutional network (FCN) to simultaneously predict tooth masks and surfaces. They then employed marker-controlled watershed transform (MWT) for tooth recognition and segmentation. Wu et al. [31] incorporated a center-sensitive mechanism into their method to guide tooth localization, thus avoiding the computational burden of numerous anchors generated by RPN in 3D CBCT images. Additionally, they employed DenseASPP-UNet for tooth segmentation and added boundary loss to reduce prediction errors on tooth boundaries. Jang et al. [19] proposed PanoramicNet, a novel tooth instance segmentation method. This method first expands the 3D tooth image into a 2D Panorama by calculating the dental arch curve. Then, it detects the teeth on the 2D Panorama images and completes instance segmentation by combining the 2D and 3D results. To address the diverse and complex tooth morphologies and reduce computational complexity, Cui et al. [12] extended their previous work [11] and introduced Hierarchical Morphology-Guided Network (HMGNet). The HMGNet utilizes tooth centroids to represent tooth positions and introduced tooth skeletons to depict the tooth’s morphological structure, which can significantly enhance tooth segmentation accuracy in complex cases.

Self-attention. The Self-Attention mechanism calculates the similarities between different positions in the input sequence, assigns weights to each position, and then uses these weights to compute the output for each position. Specifically, given an input sequence X, it first performs linear transformations to obtain three matrices QKV. Next, it calculates the similarity matrix \(QK^{T}\) by taking the dot product of each row vector in matrix Q and matrix K. Finally, these similarities are normalized into a probability distribution using the softmax function. The result is multiplied with matrix V to obtain the Self-Attention representation [30]:

$$\begin{aligned} {\text {Attention}}(Q, K, V) = {\text {softmax}}\left( \frac{Q K^{T}}{\sqrt{d_{k}}}\right) \times V, \end{aligned}$$
(1)

where \(d_k\) represents the dimension of the key vector for stabilizing the learning process. Self-Attention allows the model to capture long-range dependencies and global information from the input sequence, which can lead to improved performance in various tasks. In this paper, we introduce the Self-Attention mechanism in tooth instance segmentation for improving the accuracy.

3 Method

The overall architecture of our method for fully automatic tooth instance segmentation is shown in Fig. 1, which mainly consists of three stages. In the first stage, V-Net [24] is employed for coarse binary segmentation of teeth to obtain the teeth Region of Interest (ROI). In the second stage, a muti-task UNETR++ is used to extract teeth centroids and skeletons. These provide a rough representation of the morphological structure of teeth. The third stage involves utilizing an another muti-task UNETR++ for tooth segmentation with the guidance of the tooth skeleton. This stage simultaneously generates teeth masks and boundaries.

Fig. 1.
figure 1

The overall architecture of our method for fully automatic tooth instance segmentation.

3.1 Multi-task UNETR++

In this paper, the multi-task UNETR++ is employed as the backbone network for tooth instance segmentation. As depicted in Fig. 2, the network follows a hierarchical Encoder-Decoder structure. To process the input 3D image, it is first converted into 3D patches using Patch Embedding [13]. Given an input image \(X \in R^{H\times W\times D}\), it is partitioned into patches of resolution (\(P_{h}\), \(P_{w}\), \(P_{d}\)), resulting in feature maps of size \(\frac{H}{P_{h}} \times \frac{W}{P_{w}} \times \frac{D}{P_{d}} \times C\). Throughout the experiments, a patch size of (4, 4, 4) is utilized. The designed multi-task UNETR++ introduces an additional decoder for the achieving the multi-tasks, such as tooth mask segmentation and tooth boundary estimation. Both decoders use skip connections to obtain feature maps from the encoder at each layer.

Fig. 2.
figure 2

The architecture of Multi-task UNETR++.

The core design of UNETR++ is the Efficient Pairwise Attention (EPA) blocks. It can effectively learn spatial and channel features through a pair of interdependent branches based on spatial and channel attention [28]. According to Eq. 1, spatial and channel attention can be calculated:

$$\begin{aligned} \begin{aligned} & A_{s} = Attention(Q_{shared}, K_{spatial}, V_{spatial}) \\ & A_{c} = Attention(Q_{shared}, K_{shared}, V_{channel}) \end{aligned} \end{aligned}$$
(2)

In the spatial attention, \(V_{spatial}(HWD \times C)\) and \(K_{shared}(HWD \times C)\) are linearly projected into low-dimensional matrices \(V_{spatial}(p \times C)\) and \(K_{spatial}(p \times C)\), respectively. To facilitate communication between the branches of spatial and channel attention, the weights of the query and key mapping functions are shared, achieving Paired-Attention. This operation also reduces the total number of network parameters. Finally, the spatial attention map and channel attention map are fused through convolutional operations:

$$\begin{aligned} X = Conv_{1}(Conv_{3}(A_{s} + A_{c})). \end{aligned}$$
(3)

The Conv3 represents a convolutional block with a \(3\times 3\times 3\) kernel size, while Conv1 represents a convolutional block with a \(1\times 1\times 1\) kernel size.

3.2 Obtaining ROI of Teeth

The first step for the input 3D CBCT image is to obtain the Region of Interest (ROI) containing teeth. This step can reduce the computational workload for the subsequent tooth centroid and skeleton extraction phase, as well as the segmentation phase. Moreover, it has the potential to improving the overall segmentation accuracy. The specific pipeline of this step is illustrated in Fig. 3. V-Net is used to perform binary segmentation of the image (without distinguishing individual teeth), resulting in the tooth’s foreground region. Then, the tooth ROI can be computed from this foreground region.

Fig. 3.
figure 3

Computing the ROI of teeth from 3D CBCT image.

In order to accurately compute the ROI, the loss function used for training in this step is the combination of the Dice loss and the Cross-Entropy loss.

$$\begin{aligned} L_{s1} = L_{seg} = \alpha \cdot L_{dice} + (1-\alpha ) \cdot L_{ce}, \end{aligned}$$
(4)

where

$$\begin{aligned} L_{dice} = 1-\frac{2 \sum _{i=1}^N p_i q_i+\epsilon }{\sum _{i=1}^N p_i^2+\sum _{i=1}^N q_i^2+\epsilon }, \end{aligned}$$
(5)
$$\begin{aligned} L_{ce} = -\frac{1}{N} \sum _{i=1}^N\left( q_i \log \left( p_i\right) +\left( 1-q_i\right) \log \left( 1-p_i\right) \right) . \end{aligned}$$
(6)

Here, \(p_i\) represents the value of the i-th voxel in the predicted result, \(q_i\) represents the value of the i-th voxel in the ground truth label, and \(\epsilon \) is a very small number used to prevent division by zero.

3.3 Extraction of Teeth Centroids and Skeletons

The tooth centroid helps determine the tooth’s position and instantiate its label, while the tooth skeleton provides an approximate representation of the tooth’s morphological structure. By combining the centroid and skeleton information, they can provide guidance for the tooth instance segmentation. The process of this step is illustrated in Fig. 4.

Fig. 4.
figure 4

Extract the centroid and skeleton of teeth

The 3D image is processed by two UNETR++ sub-networks, each containing two decoders. One decoder predicts the binary segmentation map, while the other predicts the 3D offset map. The centroid offset map represents the offset between each voxel and its corresponding tooth centroid, while the skeleton offset is the offset between each voxel and the nearest point on the tooth skeleton. By adding the tooth centroid offset vector to the current foreground voxel coordinates, the tooth centroid density map is obtained. After obtaining the tooth centroid density map, a clustering method [27] is applied to cluster the tooth centroid density map to get tooth instance centroid labels. These labels are then mapped onto the tooth foreground, resulting in instance-level tooth foreground images. Similarly, by using the tooth foreground and skeleton offset vector maps together in the clustering operation, the final instance-level teeth skeleton labels are obtained.

In this step, the loss function considers both the tooth mask segmentation and the tooth centroid or skeleton parts,

$$\begin{aligned} L_{s2} = L_{seg} + L_{cs}, \end{aligned}$$
(7)

where \(L_{seg}\) represents the loss for tooth mask segmentation, which combines Dice loss and Cross-Entropy loss. \(L_{cs}\) represents the loss for tooth centroid or skeleton, using L1 Loss.

Fig. 5.
figure 5

Complete instance segmentation of teeth

3.4 Tooth Instance Segmentation

The final step for tooth instance segmentation is illustrated in Fig. 5. After obtaining the tooth ROI and tooth skeleton, each individual tooth can be cropped around its centroid. The cropped tooth, along with its skeleton, is then concatenated and used as the input to the multi-task UNETR++ model. The model’s output simultaneously predicts tooth masks and tooth boundaries, aiming to maintain accurate tooth segmentation while minimizing errors in tooth surface segmentation.

The loss function for individual tooth segmentation considers both tooth mask segmentation (\(L_{seg}\)) and tooth boundary segmentation (\(L_b\)):

$$\begin{aligned} L_{s3} = \lambda L_{seg} + \mu L_{b}. \end{aligned}$$
(8)

Here, \(L_{seg}\) represents the loss for tooth mask segmentation, which combines Dice loss and Cross-Entropy loss. \(L_b\) represents the loss for tooth boundary segmentation, using L2 Loss. In the experiments, \(\lambda = 0.6\) and \(\mu = 0.1\).

4 Experiments

4.1 Experimental Setup

Dataset. We evaluate the performance of our method on the tooth dataset from [9]. This dataset consists of 100 three-dimensional CBCT images of teeth. After excluding two cases where the tooth images did not match the corresponding annotation labels, we were left with 98 valid data cases. Throughout the experiments, the complete dataset was randomly split into 70 cases for training, 8 samples for validation, and 20 samples for testing.

Data Preprocessing. First, we normalize each CBCT image to the range [0, 1]. The specific data preprocessing at different stages is as follows. (1) Obtaining tooth ROI: Due to the limitations of GPU memory, the input tooth CBCT images are randomly cropped to a size of \(256\times 256\times 256\). (2) Extracting tooth centroids and skeletons: The tooth centroid uses the center of the tooth label. The distance-transform-based algorithm is used to obtain the tooth skeleton [23], which iteratively removes voxels from the binary mask until the skeleton is extracted. After that, the tooth image, tooth label, and tooth skeleton are randomly cropped to a size of \(128\times 128\times 128\) for training. (3) Tooth instance segmentation: The tooth boundaries are computed by the Canny edge detection algorithm [3] on each 2D CT slice of the 3D CBCT image.

Implementation Details. The experiments were conducted using the PyTorch framework and a GeForce RTX 3090 GPU. The batch sizes for the three stages are 1, 1, and 4. The initial learning rates are set to 0.001, 0.001, and 0.0001 for the three stages. A polynomial learning rate decay strategy is used, which continuously decreases the learning rate during training. The Adam optimizer with a weight decay of 0.0001 is used for optimization. The number of iterations for the three stages are set as 30k, 60k, and 50k, respectively.

Evaluation Metrics. In this study, multiple metrics are used to assess the accuracy and surface error of the segmentation model, including Dice similarity coefficient (DSC), Jaccard Index, Average Surface Distance (ASD), Sensitivity (Sen), and Hausdorff distance (HD).

4.2 Experimental Results

Comparison with Other Methods. In order to validate the segmentation performance of the proposed fully automatic tooth segmentation model based on UNETR++, we conducted comparison experiments with state-of-the-art (SOTA) methods for tooth instance segmentation based on CNN. These methods include MWTNet [7] based on 3D-FCN, ToothNet [11] based on 3D RPN, CGDNet [31] which utilizes tooth center guidance and DenseASPP-UNet, as well as HMGNet [12] based on tooth center and skeleton guidance, and V-Net.

Table 1. Comparison of segmentation accuracy with other models.

In Table 1, we can see that our model can achieve a Dice Similarity Coefficient (DSC) of 95.1% and a Jaccard index (Jaccard) of 90.8%. These values are higher than other comparison methods, which indicates that our model performs the best in terms of accuracy for tooth segmentation. Additionally, our model also shows the smallest values of Average Surface Distance (ASD) and Hausdorff Distance (HD), which are 0.14 mm and 1.39 mm, respectively. These results prove that the proposed fully automatic tooth instance segmentation method based on UNETR++ not only exhibits higher overall similarity in tooth segmentation, but also performs better in tooth edge segmentation.

To summarize, the tooth segmentation model based on UNETR++ proposed in this study outperforms existing tooth instance segmentation models. This is attributed to the introduction of Self-Attention, which allows the model to easily capture the global information from 3D CBCT images. Additionally, tooth centroids and tooth skeletons contribute to the coarse description of teeth and help improve the accuracy of our model. And the high segmentation accuracy achieved for both tooth masks and tooth edges proves that multi-task learning has shown significant effectiveness.

Ablation Experiment. To better investigate the effectiveness of the proposed UNETR++-based tooth instance segmentation model, we conducted ablation experiments while keeping all other experimental conditions the same. In these experiments, the main backbone networks for all three stages were replaced with V-Net as a baseline, and then UNETR++ was used in the second and third stages and compared with the original model. The results are shown in Table 2.

Table 2. The performance of replacing the UNETR++ with V-Net.

Table 2 shows that using UNETR++ in both the second and third stages led to improvements in the model’s performance. Replacing the network for tooth centroid and skeleton extraction with multi-task UNETR++ in the second stage resulted in a 0.2% increase in DSC and a 0.3 mm reduction in ASD. In the third stage, replacing the network for tooth segmentation with multi-task UNETR++ led to a 1.1% increase in DSC and a 0.8 mm reduction in ASD. Lastly, replacing the backbone networks for both the second and third stages with multi-task UNETR++ resulted in a 1.4% increase in DSC and a 0.8 mm reduction in ASD.

These results indicate that the utilization of the UNETR++ network with Self-Attention mechanism in this study significantly improved the tooth segmentation performance compared to the V-Net based CNN tooth segmentation model. This is because in 3D CBCT images, the morphology and positions of teeth are quite complex, and the images themselves contain massive amounts of information. Therefore, obtaining the global information of the images can help improve the model’s performance. The introduction of the Self-Attention mechanism in UNETR++ allows for the effective capture of the global information from 3D CBCT images. The EPA (Efficient Pairwise Attention) blocks play a key role in this process.

5 Conclusion and Future Work

In this paper, we investigate a tooth segmentation method for 3D CBCT images. We use UNETR++ as the backbone network due to its low parameter count, low computational requirements, and state-of-the-art performance in medical image segmentation. We establish a fully automated tooth instance segmentation approach. It begins by obtaining the tooth’s Region of Interest (ROI) using V-Net. Subsequently, it represents the tooth’s morphological structure coarsely by predicting tooth centroids and tooth skeletons. These centroids and skeletons are then used to guide the tooth instance segmentation process. As a result, the fully automated tooth instance segmentation method developed in this paper outperforms other tooth segmentation methods in terms of both tooth instance segmentation accuracy and tooth boundary segmentation error on CBCT images. This also indicates that using a network with Self-Attention mechanisms can achieve excellent segmentation results in tooth segmentation.

Although our method can outperform other comparison methods, some limitations still exist in this research: (1) Despite UNETR++ being a lightweight model with fewer parameters and computational requirements, the training time is still longer than that of simple CNN networks. (2) The model requires multiple steps to provide guidance for the final segmentation, which necessitates training multiple networks independently for each step. Future work can be focused on the following directions: (1) Researching even lighter segmentation networks or performing data preprocessing to speed up the training process. (2) Refining the model to combine tasks such as obtaining tooth ROI, tooth centroids, tooth skeletons, and tooth masks into a single stage, achieving tooth instance segmentation in a one-stage process. By addressing these limitations and exploring new approaches, the tooth segmentation method can be further improved and applied more effectively in clinical settings.