Keywords

1 Introduction

Thyroid nodule is a common clinical problem [1] and its incidence rate has risen rapidly worldwide. Ultrasound imaging technology has the characteristics of non-invasive, non-radioactive, convenient and inexpensive [2]. It is the primary tool for the diagnosis of thyroid nodule diseases. The diagnosis of thyroid nodules in ultrasound images depends on experienced clinicians [3]. However, due to the low contrast and low signal-to-noise ratio of ultrasound images, it hinders clinicians from making effective diagnosis. In order to solve this problem, more and more computer-aided diagnosis(CAD) systems are developed to assist in the diagnosis of thyroid diseases. In traditional CAD systems, the Region of Interest (ROI) of nodules is first defined manually by the clinicians, which is very time consuming and highly dependent on the clinicians’ experience, and then the nodules are segmented based on the ROI. Therefore, automatic detection and segmentation of thyroid nodules is essential for CAD systems. The detection of thyroid nodules is used to predict the bounding boxes of nodules, and then automatic segmentation of nodules is performed based on the bounding boxes, which can effectively reduce the workload of clinicians.

In recent years, many deep learning methods have been proposed and applied to the detection and segmentation of thyroid nodules in ultrasound images.

Thyroid Nodule Detection Methods. Thyroid nodule detection models in ultrasound images can be divided into two types: two-stage models and one-stage models. In order to obtain higher detection precision, the two-stage models are usually applied to the detection of thyroid nodules. Li et al. [4] proposed an improved Faster R-CNN [12] for thyroid papillary carcinoma detection. By using the strategy of layer concatenation, the detector can extract the features of surrounding region around the cancer regions, which improves the detection performance. Liu et al. [5] replaced the layer concatenation strategy with Feature Pyramid Network(FPN) [13] and added it to Faster R-CNN [12] to construct a multi-scale detection network, which can extract the features of nodules with different scales. Abdolali et al. [6] replaced the network backbone from Faster R-CNN [12] to Mask R-CNN [14] with higher performance, using a well-designed loss function and transfer learning strategy to achieve high accuracy on a small dataset. These two-stage detection models mentioned above can obtain high precision in thyroid nodule detection, but the detection speed is lower than the one-stage models. In order to detect thyroid nodules with different scales, Song et al. [7] utilized a multi-scale SSD [15] model with spatial pyramid module to achieve high detection accuracy. To fully extract multi-scale features from feature maps, shahroudnejad et al. [8] constructed a one-stage model with FPN for detecting and classifying pyramid nodules, which can extract global and local information from feature maps. The above detection methods fully extracted thyroid nodule features at different scales by adding modules that extract multi-scale features, such as the connection between low-level and high-level layers, and FPN, thereby improving the accuracy of detecting thyroid nodules.

Thyroid Nodule Segmentation Methods. Ying et al. [9] proposed a cascaded convolutional neural network that first segmented the Region of Interest(RoI) containing thyroid nodules, and then used a VGG network to accurately segment thyroid nodules on the basis of RoI. Wang et al. [10] constructed a cascade segmentation network based on DeepLabv3plus [16]. The rough location of nodules was first obtained, and then the nodules were segmented accurately based on the rough location, which eliminated the influence of the area around the nodules on the segmentation results, and thus obtained more accurate segmentation results. To remove the mistake recognition of non-thyroid areas as nodules, Gong et al. [11] embedded a priori guided feature module of thyroid region into the nodule segmentation model for the first time, which improved the accuracy of nodule localization and enhanced the segmentation performance of thyroid nodules. The above-mentioned thyroid nodule segmentation methods first remove the influence of irrelevant regions, and then perform further segmentation on the Region of Interest(RoI), thus reducing the false recognition of non-nodular regions as nodules.

Although many deep learning methods have been applied to the detection and segmentation tasks of thyroid nodules, most of them only complete one of the two tasks. Only a few methods can detect and segment thyroid nodules simultaneously. Among them, thyroid nodule detection methods achieve high accuracy while maintaining high efficiency, but there are still many problems in detecting thyroid nodules with extreme sizes, nodules with complex internal texture, and multiple nodules. It leads to missed detection of small nodules, false detection of intermediate nodules, and false detection of tissue similar to nodules as nodules. In addition, thyroid nodule segmentation method achieves high accuracy while there are still many challenges to be solved in becoming a real-time system.

To address the above problems, we propose a multi-task thyroid nodule detection and segmentation model based on Trident network [17], called MTN-Net. It is embedded with a novel semantic segmentation branch for accurate segmentation of thyroid nodules, and it includes an improved NMS algorithm, called TN-NMS, for combining the thyroid nodule detection results from multiple branches. Therefore, MTN-Net achieves significant effects on the detection of thyroid nodules with different sizes and thyroid nodules with complex internal texture, and effectively suppresses the false detection of intermediate nodules in large nodules.

The main contributions of this paper can be summarized as follows:

  • We propose a multi-task network based on Trident network [17] for the detection and segmentation of thyroid nodules in ultrasound images, which can generate specific scale feature maps through trident block [17] with different receptive fields. So it is effective in detecting thyroid nodules with different sizes.

  • A novel semantic segmentation branch based on FCN [18] is embedded into the detection network to complete the segmentation task of thyroid nodules, which is valid for completely segmenting the thyroid nodules with complex texture.

  • We propose an improved NMS algorithm called TN-NMS to fuse the detection results from multiple branches, which can successfully suppress the false detection results of internal nodules in large nodules.

The rest of this paper is as follows: we first describe the details of our proposed model and the feature generation in Sect. 2. We then introduce the experimental setup and results in Sect. 3. Finally, we conclude our work and indicate future directions in Sect. 4.

2 Method

2.1 Overall Architecture

The proposed MTN-Net is a multi-branch two-stage thyroid nodule detection and segmentation model based on Trident network [17]. Figure 1 illustrates the overall architecture of our proposed MTN-Net. The network is composed of backbone, extended Faster R-CNN head, and TN-NMS algorithm. We adopt ResNet-101 with trident blocks as the backbone, in which the conv4_x stage consists of trident blocks containing three branches. It can fully extract the multi-scale features of thyroid nodules in ultrasound images, and thus contributing to the detection of thyroid nodules with different sizes. Additionally, we add a novel semantic segmentation branch to the extended Faster R-CNN head to accomplish the thyroid nodule segmentation task. Finally, an improved NMS algorithm called TN-NMS is used to combine the detection results of thyroid nodules from multiple branches.

Ultrasound images of thyroid nodules are input to the backbone to generate feature maps with different receptive fields. They are then fed into the extended Faster R-CNN head to produce the corresponding detection and segmentation results, which are eventually combined by the TN-NMS algorithm to generate the output results.

Fig. 1.
figure 1

The architecture of proposed MTN-Net. MTN-Net is comprised of backbone (ResNet-101 with trident blocks), extended Faster R-CNN head, and TN-NMS algorithm.

2.2 Semantic Segmentation Branch

We use a novel semantic segmentation branch based on FCN [18] to segment thyroid nodules. This semantic segmentation branch is embedded into the Faster R-CNN detection head and parallel to the bounding-box classification and regression. In addition, we add an RoIAlign [14] layer in Faster R-CNN head to remove the rough space quantization of RoIPool [19], which can improve the accuracy of mask prediction at pixel level. The extended Faster R-CNN head is displayed in Fig. 2. Different from the existing extended Faster R-CNN heads mentioned in [14], our extended Faster R-CNN head has a novel semantic segmentation branch capable of segmenting thyroid nodules with complex textures more completely. We add four convolution layers before the deconvolution layer of the semantic segmentation branch to fully obtain the features in the Region-of-Interest(RoI), so as to completely segment the nodules with complex internal texture. Meanwhile, we add \(L_{mask}\) to the loss function. For some predicted boxes that do not contain thyroid nodules, the proposed semantic segmentation branch can suppress some incorrectly detected boxes through \(L_{mask}\).

Fig. 2.
figure 2

The architecture of our extended Faster R-CNN head, in which a novel semantic segmentation branch is embedded to complete the segmentation task of thyroid nodules.

Fig. 3.
figure 3

The results of a thyroid nodule with complex internal texture being correctly detected (yellow) and incorrectly detected (red and green) along with their confidence scores. Since the iou (0.03) of the red and yellow boxes, as well as the iou (0.11) of the green and yellow boxes, are much smaller than the threshold 0.5, the results of incorrect detections cannot be suppressed using the NMS algorithm (as shown in (c)). In contrast, the niou (1.0) of the red box and the yellow box, as well as the niou (1.0) of the green box and the yellow box exceed the threshold 0.9, so the TN-NMS algorithm can successfully suppress the bounding boxes of these false detections (as shown in (d)) (Color figure online)

2.3 TN-NMS

figure a

NMS is utilized to merge the detection results from multiple branches in Trident network [17]. It is described as [20]:

$$\begin{aligned} S_{i}=\left\{ \begin{aligned} S_{i},&\quad iou \left( \mathcal {M}, b_{i}\right) <N_{t} \\ 0,&\quad iou \left( \mathcal {M}, b_{i}\right) \ge N_{t} \end{aligned} \right. \end{aligned}$$
(1)

The input data in Eq. 1 consists of an ordered list of detection boxes Boxes with scores Scores and a threshold \(N_{t}\). \(S_{i}\) represents a re-scoring function, \(\mathcal {M}\) is the box with the highest score in Boxes, \(b_{i}\) indicates the currently selected box in Boxes, iou denotes the intersection area divided by the union area of two boxes, \(N_{t}\) is a threshold indicating whether the currently selected box \(b_{i}\) should be removed. NMS starts by selecting the bounding box \(\mathcal {M}\) with the highest score in Boxes, calculates the iou of the remaining bounding boxes \(b_{i}\) in Boxes and \(\mathcal {M}\), then deletes the bounding box \(b_{i}\) whose iou is greater than the threshold \(N_{t}\), which is usually set to 0.5. However, the area of intermediate nodules detected by mistake is usually much smaller than that of large nodules, resulting in the iou of their corresponding bounding boxes less than 0.5, and thus the NMS algorithm is unable to suppress the results of these false detections, as shown in Fig. 3(c). Therefore, in order to suppress the bounding box of these intermediate nodules, we propose a new calculation method for thyroid nodule detection, named niou, which represents the intersection region of \(b_{i}\) and \(\mathcal {M}\) divided by the region of \(b_{i}\). It is described as:

$$\begin{aligned} niou \left( \mathcal {M}, b_{i}\right) =\frac{\mathcal {M} \cap b_{i}}{b_{i}} \end{aligned}$$
(2)

The niou calculated by the bounding box of incorrectly detected nodules and correctly detected nodules is usually equal to or close to 1.0, so that the results of incorrect detection above the threshold 0.9 are successfully suppressed, as shown in Fig. 3(d). Meanwhile, we add niou to the NMS algorithm and propose an improved NMS algorithm, named TN-NMS, which is used to combine the detection results of three branches and is described as:

$$\begin{aligned} S_{i}=\left\{ \begin{array}{cc} S_{i}, &{} \quad iou \left( \mathcal {M}, b_{i}\right)<N_{t_{1}} \text{ and } niou \left( \mathcal {M}, b_{i}\right) <N_{t_{2}} \\ 0, &{} \quad iou \left( \mathcal {M}, b_{i}\right) \ge N_{t_{1}} \text{ or } niou \left( \mathcal {M}, b_{i}\right) \ge N_{t_{2}} \end{array}\right. \end{aligned}$$
(3)

where \(N_{t_{1}}\) and \(N_{t_{2}}\) are thresholds that determine whether the currently selected bounding box \(b_{i}\) should be removed from Boxes. The detailed process of TN-NMS is shown in Algorithm 1. In each step of TN-NMS, the scores of all detection boxes that overlap with \(\mathcal {M}\) are updated, then the detection boxes with a score of 0 are removed from Boxes, hence the computational complexity of each step of TN-NMS is \(\mathcal {O}(N)\), where N is the number of detection boxes in Boxes. Therefore, for N detection boxes in Boxes, the computational complexity of the TN-NMS algorithm is \(\mathcal {O}(N^2)\), which is the same as that of the NMS algorithm.

2.4 Loss Function

As shown in Fig. 1, the proposed network is a multi-task network, whose loss function combines the loss of classification, bounding box regression and segmentation. In order to improve performance, we add weighting factors to the loss function of each task. Therefore, the total loss function on each Region of Interest(RoI) is defined as follows:

$$\begin{aligned} L_{\text {total }}=\lambda _{c l s} * L_{\text {cls }}+\lambda _{\text {box }} * L_{\text {box }}+\lambda _{\text {mask }} * L_{\text {mask }} \end{aligned}$$
(4)

where \(L_{cls}\), \(L_{box}\), \(L_{mask}\) indicate classification loss, bounding box regression loss and mask segmentation loss respectively. \(\lambda _{cls}\), \(\lambda _{box}\), \(\lambda _{mask}\) are weighting factors of each component. We use the cross entropy loss function to calculate the classification loss of thyroid nodules, and utilize the smooth L1 loss for boundary box regression. The definitions of these two tasks are the same as those defined in [19]. Besides, we adopt the binary cross entropy loss to calculate the mask segmentation loss defined on the foreground proposals. Therefore, the loss of mask segmentation task is defined as follows:

$$\begin{aligned} L_{\text {mask }}=-\frac{1}{n^{2}} \sum _{0 \le i, j \le n} B C E\left( y_{i j}, y_{i j}^{*}\right) \end{aligned}$$
(5)

where n is the length and width of each mask, \(y_{ij}\) is the predicted value and \(y_{ij}^{*}\) is the growth truth of each class. Furthermore, weighting factors can help optimize the performance of classification, detection and segmentation tasks.

3 Experiments

3.1 Dataset and Preprocessing

We evaluated the proposed architecture on the public thyroid nodule region segmentation dataset called TN3K provided in [11], which contains 3493 ultrasound images obtained from 2421 patients. In addition, we compare the performance of our proposed method with State-of-the-Arts methods on the public DDTI dataset [21]. It contains 347 thyroid ultrasound images from 299 patients with thyroid disease, annotated by radiologists for thyroid nodule segmentation results. All the cases in the DDTI dataset are from the IDIME Ultrasound Department, one of the largest imaging centers in Colombia.

In order to adopt these two datasets to thyroid nodule detection and segmentation, we add the bounding box annotation for object detection. Besides, we use the operation of adaptive histogram equalization for each image to transform the gray level of the image, so as to improve the contrast of the image. In addition, we perform data augmentation operations on the preprocessed images used for training, including random mirror flip, random left-right flip, random clipping, random sharpening, random increase or decrease of image contrast.

3.2 Implementation Details

The proposed network is implemented in PyTorch 1.8.1. The experimental codes are modified on the basis of Detectron2 [22], and many default configuration parameters are used for model training and inference. The model is trained on two NVIDIA Tesla P100 GPUs with a batch size of 16, and the backbone of the network is pre-trained on MS-COCO [23]. In our experiments, \(N_{t_{1}}\) and \(N_{t_{2}}\) in TN-NMS are set to 0.5 and 0.9 respectively, and \(\lambda _{cls}\), \(\lambda _{box}\), \(\lambda _{mask}\) of loss function are set to 2, 5 and 2 respectively. Moreover, the model is trained with the stochastic gradient descent optimizer and the learning rate of warmup and cosine annealing for 50 epochs, whose learning rate increases linearly to 0.05 in the first 1000 iterations, then decreases gradually in the form of cosine annealing. The total time of model training is 20 h, and the inference time of each image is 0.85 s.

3.3 Evaluation Metrics

For the evaluation, in order to accurately quantify the performance of our model, standard COCO metrics including AP (Average Precision), \(AP_{50}\) and metrics for evaluating the Average Precision of objects with different size, including \(AP_{S}\) (less than \(32\times 32\)), \(AP_{M}\) (from 32 \(\times \) 32 to \(96\times 96\)), \(AP_{L}\) (greater than \(96\times 96\)) are used as evaluation metrics. Since the smallest thyroid nodule contained in the DDTI dataset are larger than \(32\times 32\) pixels in size, \(AP_{S}\) cannot be used as an evaluation metric for the DDTI dataset. Therefore, we measure the thyroid nodule detection and segmentation performance of AP, \(AP_{50}\), \(AP_{M}\), \(AP_{L}\) on the DDTI dataset.

3.4 Ablation Study

In order to validate the performance of our proposed architecture, the evaluation metrics of detection and segmentation are used to quantify the comparison between our proposed model and baseline model. The baseline is Trident network with a mask prediction branch proposed in [14], which includes a 2\(\times \)2 deconvolution layer with stride 2 and a \(1\times 1\) convolution layer for predicting mask. Baseline/ResNet-101 backbone refers to the baseline network with ResNet-101 as the backbone. Then we respectively add semantic segmentation branches and TN-NMS algorithm on baseline, which is denoted as bNet+S and bNet+T.

Table 1. Ablation studies on the detection of thyroid nodules.
Table 2. Ablation studies on the segmentation of thyroid nodules.

As shown in Table 2, bNet+S improves 1.2% and 1.0% on \(AP_{L}\) for nodule segmentation on TN3K and DDTI, respectively, which indicates that semantic segmentation branche has high performance in segmenting large nodules. From Table 1, we can see that bNet+T has a 3.9% and 0.4% improvement on \(AP_{S}\) and \(AP_{L}\) for TN3K and 0.2% improvement on \(AP_{L}\) for DDTI, respectively, which demonstrates that the TN-NMS algorithm improves the detection performance of large and small nodules by suppressing the internal nodules in large nodules. When both are added into baseline, MTN-Net greatly enhances in all evaluation metrics compared to baseline. However, the \(AP_{S}\) of MTN-Net is lower than that of bNet+T. We consider that the semantic segmentation branch focuses too much on large nodules, and thus has lower performance on the detecting and segmenting small nodules, there by leading to the lower performance of MTN-Net than that of bNet+T.

3.5 Comparisons Against State-of-the-Arts Methods

We compared our framework MTN-Net with several state-of-the-art approaches, including Mask R-CNN [14], Cascade Mask R-CNN [24], Mask Scoring R-CNN [25], PointRend [26]. Mask R-CNN is a commonly used two-stage detection and segmentation model. And Cascade Mask R-CNN is a multi-head model based on Cascade R-CNN, which has higher detection accuracy than Mask R-CNN. Besides, Mask Scoring R-CNN adds a branch for scoring masks on the basis of Mask R-CNN, which enhances the accuracy of segmentation. Furthermore, PointRend is optimized for image segmentation at the edges of objects, resulting in better performance at the hard-to-segment edges of objects.

Table 3. Performance comparison of thyroid nodule detection on TN3K and DDTI.
Table 4. Performance comparison of thyroid nodule segmentation on TN3K and DDTI.
Fig. 4.
figure 4

Qualitative comparison of our MTN-Net and SOTA models. Among them, Baseline, Our MTN-Net, Mask R-CNN, Cascade Mask R-CNN (yellow) are implemented based on Detectron2, and Mask Scoring R-CNN and Point Rend (green) are implemented based on MMDetection [27] (Color figure online)

Quantitative Analysis on TN3K. Tables 3 and 4 demonstrate the quantitative comparison results between our MTN-Net and other SOTA models on the public TN3K dataset. MTN-Net greatly improves AP, \(AP_{50}\), \(AP_{M}\), and \(AP_{L}\) against other SOTA models. However, the performance in detecting and segmenting small nodules (less than \(32\times 32\) pixels) is inferior to Mask Scoring R-CNN and Point Rend. Since the appearance and texture of some small nodules are extremely similar to the surrounding tissues, MTN-Net is prone to mis-detect other tissues and organs as small nodules. Nevertheless, MTN-Net has high accuracy on both \(AP_{M}\), and \(AP_{L}\), which indicates its remarkable competitiveness in detecting and segmenting medium and large nodules. Quantitative Analysis on DDTI. As shown in Tables 3 and 4, MTN-Net exceeds other SOTA models in the above metrics on the DDTI dataset. For thyroid detection, it increases 3.8%, 4.4%, 2.4%, and 0.7% for AP, \(AP_{50}\), \(AP_{M}\), and \(AP_{L}\), respectively. For thyroid segmentation, the increases are 2.4%, 2.2%, 3.2%, and 1.2% for AP, \(AP_{50}\), \(AP_{M}\), and \(AP_{L}\), respectively. This demonstrates that MTN-Net has an excellent performance in both nodule detection and segmentation when the nodule size is larger than \(32\times 32\) pixels.

Qualitative Analysis. Figure 4 illustrates the qualitative comparison results between our MTN-Net and other SOTA models. The first column of Fig. 4 shows that MTN-Net can successfully exclude false-positive detection results. And the second column of Fig. 4 illustrates that MTN-Net is able to accurately detect and segment multiple thyroid nodules. In addition, the third column of Fig. 4 displays that MTN-Net is significantly competitive in the detection of small nodules. Furthermore, the fourth column of Fig. 4 indicates that MTN-Net can not only completely segment large nodules with complex texture, but also effectively suppress internal nodules.

4 Conclusion

In this paper, we proposed a two-stage network for thyroid nodule detection and segmentation in ultrasound images. Our network is built on Trident network, which is capable of precisely detecting thyroid nodules with diverse sizes. The semantic segmentation branch added to the network is effective for fully segmenting large nodules with complex textures. In addition, we proposed an improved NMS algorithm to fuse the detection results from multiple branches, and it is useful to suppress the false detection of internal nodules. Consequently, our network achieves a remarkable competitiveness in detecting thyroid nodules with diverse sizes, segmenting completely nodules with internal texture, and suppressing incorrectly detected internal nodules. Experimental results demonstrate the effectiveness of the proposed method against other state-of-the-art methods. In the future, we will utilize self-supervision methods to further reduce the false positive rate of our model for thyroid nodule detection and segmentation in ultrasound images.