Keywords

1 Introduction

Ultrasound (US) is the primary diagnostic tool for both the detection and characterization of thyroid nodules. As part of clinical workflow in thyroid sonography, thyroid nodules are measured and their sizes are monitored over time as significant growth could be a sign of thyroid cancer. Hence, finding Region of Interest (ROI) of nodules for further processing becomes the preliminary step of the Computer-Aided Diagnosis (CAD) systems. In traditional CAD systems, the ROIs are manually defined by experts, which is time-consuming and highly relies on the experience of the radiologists and sonographers. Therefore, automatic thyroid nodule detection, which predicts the bounding boxes of thyroid nodules, from ultrasound images could play a very important role in computer aided thyroid cancer diagnosis [11, 33].

Thyroid nodule detection in ultrasound images is an important yet challenging task in both medical image analysis and computer vision fields [4, 18, 26, 29]. In the past decades, many traditional object detection approaches have been proposed [7, 34, 35, 40], such as BING [5], EdgeBox [39] and Selective Search [32]. However, due to the large variations of the targets, there is still significant room for the improvements of traditional object detection approaches in terms of accuracy and robustness. In recent years, object detection has achieved great improvements by introducing machine learning and deep learning techniques. These methods can be mainly categorized into three groups: (i) two-stage models: such as RCNN [10], Fast-RCNN [9], Faster-RCNN [24], SPP-Net [12], R-FCN [6], Cascaded-RCNN [3] and so on; (ii) one-stage models: such as OverFeat [25], YOLO (v1, v2, v3, v4, v5) [1, 2, 21,22,23], SSD [19], RetinaNet [16] and so on; (iii) anchor-free models, such as CornerNet [15], CenterNet [8], ExtremeNet [38], RepPoints [37], FoveaBox [14] and FCOS [31]. As we know, the two-stage models are originally more accurate but less efficient than one-stage models. However, with the development of new losses, e.g. focal loss [16] and training strategies, one-stage models are now able to achieve comparable performance against two-stage models while requires less time costs. The anchor-free models relies on the object center or key points, which are relatively less accessible in ultrasound images.

Almost all of the above detection models are originally designed for object detection from natural images, which have different characteristics than ultrasound images. Particularly, ultrasound images have variable spatial resolution, heavy speckle noise, and multiple acoustic artifacts, which make the detection task challenging. In addition, thyroid nodules have diverse sizes, shapes and appearances. Sometimes, thyroid nodules are very similar to the thyroid tissue and are not defined by clear boundaries (e.g. ill-defined nodule). Some nodules are heterogeneous due to diffuse thyroid disease, which makes these nodules difficult to differentiate from each other and their backgrounds. In addition, the occasional occurrence of multiple thyroid nodules within the same image, and large thyroid nodules with complex interior textures, which could be considered internal nodules, further increase the difficulty of the nodule detection task. These characteristics lead to high inter-observer variability among human readers, and analogous challenges for machine learning tools, which often lead to inaccurate or unreliable nodule detection.

To address the above issues, multi-scale features are very important. Therefore, we propose a novel one-stage thyroid nodule detection model, called TUN-Det, whose backbone is built upon the ReSidual U-blocks (RSU) [20], which is able to extract richer multi-scale features from feature maps with different resolutions. In addition, we design a multi-head architecture for both the nodule bounding boxes classification and regression in our TUN-Det to predict more reliable results. Each multi-head module is comprised of three different heads, which are variants of the RSU block and arranged in parallel. Each multi-head module outputs three separate outputs, which are supervised by losses computed independently in the training process. In the inference step, multi-head outputs are combined to achieve better detection performance. The Weighted Boxes Fusion (WBF) algorithm [28] is introduced to fuse the outputs of each multi-head module. In summary, our contributions are threefold: (i) a novel one-stage thyroid nodule detection network, TUN-Det, built upon the Residual U-blocks [20]; (ii) a novel multi-head architecture for both bounding boxes classification and regression heads, in which the ensemble strategy is embedded; (iii) Very competitive performance against the state-of-the-art models on our newly built thyroid nodule detection dataset.

Fig. 1.
figure 1

Architecture of the proposed TUN-Det.

2 Proposed Method

2.1 TUN-Det Architecture

Feature Pyramid Network (FPN) is one of the most popular architecture in object detection. Because the FPN architecture is able to efficiently extract high-level and low-level features from deeper and shallow layers, respectively. As we know, multi-scale features play very important roles in object detection. High-level features are responsible for predicting the classification scores while low-level features are used to guarantee the bounding boxes’ regression accuracy. The FPN architectures usually take existing image classification networks, such VGG [27], ResNet [13] and so on, as their backbones. However, each stage of these backbones is only able to capture single-scale features because image classification backbones are designed to perceive only high-level semantic meaning while paying less attention to the low-level or multi-scale features[20]. To capture more multi-scale features from different stages, we build our TUN-Det upon the Residual U-blocks (RSU), which was first proposed in salient object detection U\(^2\)-Net [20]. Our proposed TUN-Det is also a one-stage FPN similar to RetinaNet [16].

Figure 1 illustrates the overall architecture of our newly proposed TUN-Det for thyroid nodule detection. As we can see, the backbone of our TUN-Det consists of five stages. The first stage is a plain convolution layer with stride of two, which is used to reduce the feature maps resolution. The second to the fifth stages are RSU-7, RSU-6, RSU-5 and RSU-4, respectively. There is a maxpooling operation between the neighboring stages. Compared with other plain convolution, the RSUs are able to capture both local and global information from feature maps with arbitrary resolutions[20]. Therefore, richer multi-scale features {\(C_3, C_4, C_5\)} can be extracted by the backbone built upon these blocks for supporting the nodule detection. Then, an FPN [16] is applied on top of the backbone’s features {\(C_3, C_4, C_5\)} to create multi-scale pyramid features {\(P_3,P_4,P_5,P_6,P_7\)}, which will be used for bounding boxes regression and classification.

Fig. 2.
figure 2

Multi-head classification and regression module.

2.2 Multi-head Classification and Regression Module

After obtaining the multi-scale pyramid features {\(P_3,P_4,P_5,P_6,P_7\)}, the most important step is regressing the bounding boxes’ coordinates and predicting their probabilities of being nodules. These two processes are usually implemented by a regression module \(BBOX_i=R(P_i)\) and a classification module \(CLAS_i=C(P_i)\), respectively. The regression outputs {\(BBOX_3,BBOX_4,\ldots ,BBOX_7\)} and the classification outputs {\(CLAS_3,CLAS_4,\ldots ,CLAS_7\)} from different features are then fused to achieve the final detection results by conducting non-maximum suppression (NMS).

To further reduce the False Positives (FP) and False Negatives (FN) in the detection results, multi-model ensemble strategy is usually considered. However, this approach is not preferable in real-world applications due to high computational and time costs. Hence, we design a multi-head (three-head) architecture for both classification and regression modules to address this issue. Particularly, each classification and regression module consists of three parallel-configured heads, {\(C^{(1)}, C^{(2)}, C^{(3)}\)}, and {\(R^{(1)}, R^{(2)}, R^{(3)}\)}, respectively. Given a feature map \(P_i\), three classification outputs, {\(C^{(1)}(P_i)\), \(C^{(2)}(P_i)\), \(C^{(3)}(P_i)\)}, and three regression outputs, {\(R^{(1)}(P_i)\), \(R^{(2)}(P_i)\), \(R^{(3)}(P_i)\)}, will be produced. In the training process, their losses will be computed separately and summed to supervise the model training. In the inference step, the Weighted Boxes Fusion (WBF) algorithm [28] is used to fuse the regression and classification outputs of different heads. This design embeds the ensemble strategy into both the classification and regression module to improve the detection accuracy while avoiding training multiple models, which is a standard procedure in common ensemble methods.

In this paper, the architectures of \(R^{(i)}\) and \(C^{(i)}\) are the same except for the last convolution layer (see Fig. 2). To increase the diversity of the prediction results and hence reducing the variance, three variants of RSU-7 (CBAM RSU-7, CoordConv RSU-7 and BiFPN RSU-7) are developed to construct the multi-head modules. The first head is CBAM RSU-7, in which a Convolutional Block Attention Module (CBAM) [36] block is added after the standard RSU-7 block to refine features by channel (\(M_c\)) and spatial (\(M_s\)) attention. The formulation can be described as \(F_c = M_c(F_{in})\otimes F_{in}\) and \(F_s = M_s(F_c)\otimes F_c\). The second head is CoordConv RSU-7, which replaces the plain convolution layers in the original RSU-7 by Coordinate Convolution [17] layers to encode geometric information. CoordConv can be described as \(conv(concat(F_{in}, F_i, F_j))\), where \(F_{in}\in \mathbb {R} ^{(h \times w \times c)}\) is an input feature map, \(F_i\) and \(F_j\) are extra row and column coordinate channels respectively. The third head is BiFPN RSU-7, which expands RSU-7 by adding bi-directional FPN (BiFPN) [30] layer between the encoding and decoding stages to improve multi-scale feature representation. BiFPN layer has a \(\cap \)-shape architecture consisted of bottom-up and top-down pathways, which helps to learn high-level features by fusing them in two directions. Here, we use four-stage BiFPN layer to avoid complexity and reduce the number of trainable parameters.

2.3 Supervision

As shown in Fig. 1, our newly proposed TUN-Det has five groups of classification and regression outputs. Therefore, the total loss is the summation of these five groups of outputs: \(\mathcal {L} = \sum _{i=1}^{5}\alpha _i\mathcal {L}_i,\) where \(\alpha _i\) is the weight of each group (all \(\alpha \) are set to 1.0 here). For every anchor, each group produces three classification outputs \(\{C^{(1)},C^{(2)},C^{(3)}\}\) and three regression outputs \(\{R^{(1)},R^{(2)},R^{(3)}\}\). Therefore, the loss of each group can be defined as

$$\begin{aligned} \mathcal {L}_i = {\textstyle \sum }_{j=1}^3{\lambda _i^{C^{(j)}} \mathcal {L}_i^{C^{(j)}}} + {\textstyle \sum }_{j=1}^3{\lambda _i^{R^{(j)}} \mathcal {L}_i^{R^{(j)}}}, \end{aligned}$$
(1)

where \(\mathcal {L}_i^{C^{(j)}}\) and \(\mathcal {L}_i^{R^{(j)}}\) are the corresponding losses for classification and regression outputs respectively. \(\lambda _i^{C^{(j)}}\) and \(\lambda _i^{R^{(j)}}\) are their corresponding weights to determine the importance of each output. We set all the \(\lambda \) weights to 1.0 in our experiments. \(\mathcal {L}_i^{C^{(j)}}\) is the focal loss [16] for classification. It can be defined as follows:

$$\begin{aligned}&\mathcal {L}_i^{C^{(i)}} = \text {Focal}(p_t)= \alpha _t(1-p_t)^\gamma \times BCE(p_c,y_c), \nonumber \\&{ \small p_t = {\left\{ \begin{array}{ll} p_c&{} \text {if } y_c = 1 \\ 1-p_c&{} \text {otherwise} \end{array}\right. }, \quad \alpha _t = {\left\{ \begin{array}{ll} \alpha &{} \text {if } y_c = 1 \\ 1-\alpha &{} \text {otherwise,} \end{array}\right. } } \end{aligned}$$
(2)

where \(p_c\) and \(y_c\) are predicted and target classes respectively. \(\alpha \) and \(\gamma \) are focal weighting factor and focusing parameters that are set to 0.25 and 2.0, respectively. \(\mathcal {L}_i^{R^{(j)}}\) is the Smooth-L1 loss [9] for regression, which is defined as:

$$\begin{aligned}&\mathcal {L}_i^{R^{(j)}}=\text {Smooth-L1}(p_r,y_r)= { \small {\left\{ \begin{array}{ll} 0.5(\sigma x)^2&{} \text {if } |x| < \frac{1}{\sigma ^2} \\ |x|-\frac{0.5}{\sigma ^2}&{} \text {otherwise,} \end{array}\right. } }, \quad x=p_r-y_r \end{aligned}$$
(3)

where \(p_r\) and \(y_r\) are predicted and ground truth bounding boxes respectively. \(\sigma \) defines where the regression loss changes from L2 to L1 loss. It is set to 3.0 in our experiments.

3 Experimental Results

3.1 Datasets and Evaluation Metrics

To validate the performance of our newly proposed TUN-Det on ultrasound thyroid nodule detection task, we build a new thyroid nodule detection dataset. The dataset was retrospectively collected from 700 patients aged between 18–82 years who presented at 12 different imaging centers for a thyroid ultrasound examination. Our retrospective study was approved by the health research ethics boards of the participating centers. There are a total of 3941 ultrasound images, which are extracted from 1924 transverse (TRX) and 2017 sagittal (SAG) scans. These images are split into three subsets for training (2534), validation (565) and testing (842) with 3554, 981, and 1268 labeled nodule bounding boxes, respectively. There is no common patient in the training, validation and testing sets. All nodule bounding boxes are manually labeled by 5 experienced sonographers (with \(\ge \)8 years of experience in thyroid sonography) and validated by 3 radiologists. To evaluate the performance of our TUN-Det against other models, Average Precision (AP) [16] is used as the evaluation metric. The validation set is only used to select the model weights in the training process. All the performance evaluation conducted in this paper is based on the testing set.

3.2 Implementation Details

Our proposed TUN-Det is implemented in Tensorflow 1.14 and Keras. The input images are resized to \(512\times 512\) and the batch size is set to 1. The model parameters are initialized by Xavier and Adam optimizer with default parameters is used to train the model. Both our training and testing process are conducted on a 12-core, 24-thread PC with an AMD Ryzen Threadripper 2920x 4.3 GHz CPU (128 GB RAM) with an NVIDIA GTX 1080Ti GPU (11GB memory). The model converges after 200 epochs and takes 20 h in total. The average inference time per image (\(512\times 512\)) is 94 ms.

Table 1. Ablation on different backbones and heads configurations. \(AP_{35}\), \(AP_{50}\), \(AP_{75}\) are average precision at the fixed \(35\%\), \(50\%\), \(75\%\) IoU thresholds, respectively. AP is the average of AP computed over ten different IoU thresholds from \(50\%\) to \(95\%\) [\(AP_{50}\), \(AP_{55}\), \(\cdots \), \(AP_{95}\)].

3.3 Ablation Study

To validate the effectiveness of our proposed architecture, ablation studies are conducted on different configurations and the results are summarized in Table 1. The first two rows show the comparison between the original RetinaNet and the RetinaNet-like detection model with our newly developed backbones built upon the RSU-blocks. As we can see, our new adaptation greatly improves the performance against the original RetinaNet. The bottom part of the table illustrates the ablation studies on different configurations of classification and regression modules. It can be observed that our multi-head classification and regression modules, CoordConv-CBAM-BiFPN, shows better performance against other configurations in terms of the AP, \(AP_{35}\) and \(AP_{50}\).

3.4 Comparisons Against State-of-the-Arts

Quantitative Comparisons. To evaluate the performance of our newly proposed TUN-Det, we compare our model against six typical state-of-the-art detection models including (i) Faster-RCNN [24] as a two-stage model; (ii) RetinaNet [16], SSD [19], YOLO-v4 [2] and YOLO-v5 [1] as one stage models; and (iii) FCOS [31] as an anchor-free model. As shown in Table 2, our TUN-Det greatly improves the AP, \(AP_{35}\), \(AP_{50}\), and \(AP_{75}\) against Faster-RCNN, RetinaNet, SSD, YOLOV4 and FCOS. Compared with YOLO-v5, our TUN-Det achieves better performance in terms of \(AP_{35}\) and Although our model is inferior in terms of \(AP_{75}\), it is doing a better job in terms of FN (i.e. our Average Recall at \(75\%\), \(AR_{75}\), is 45.5 vs. 40.3 in YOLO-v5), which is a priority in the context of thyroid nodule detection to not missing any nodules. Having low Recall with high Precision is unacceptable as it would miss many cancers. Regarding AP, it is usually reported to show the average performance. However, in practice we seek a threshold for achieving final detection results in real-world clinical applications. According to the experiments, our model achieves the best performance under different IoU thresholds (e.g. \(35\%\), \(50\%\)), which means our model is more applicable to clinical workflow.

Table 2. Comparisons against the state-of-the-arts.
Fig. 3.
figure 3

Qualitative comparison of ground truth (green) and detection results (red) for different methods. Each column shows the result of one method. (Color figure online)

Qualitative Comparisons. Figure 3 shows the qualitative comparison of our TUN-Det with other SOTA models on sampled sagittal scans (first two rows) and transverse scans (last two rows). Each column shows the result of one method. The ground truth is shown with green and detection result is shown in red. Figure 3 (1st row) shows that TUN-Det can correctly detect the challenging case of a non-homogeneous large hypo-echoic nodule, while all other methods fail. The 2nd row illustrate that TUN-Det performs well in detecting nodules with ill-defined boundaries, while others miss them. The 3rd and 4th rows highlight that our TUN-Det successfully excludes the false positive and false negative nodules. The last column of Fig. 3 signifies that our TUN-Det produces the most accurate nodule detection results.

4 Conclusion and Discussion

This paper proposes a novel detection network, TUN-Det. The novel backbone, built upon the RSU blocks, of our TUN-Det greatly improves the detection accuracy by extracting richer multi-scale features from feature maps with different resolutions. The newly proposed multi-head architecture for both classification and regression heads further improves the nodule detection performance by fusing outputs from diversified sub-modules. Experimental results show that our TUN-Det achieves very competitive performance against existing detection models on overall AP and outperforms other models in terms of \(AP_{35}\) and \(AP_{50}\), which indicates its promising performance in practical applications. We believe that this architecture is also promising for other detection tasks on ultrasound images. In the near future, we will focus on improving the detection consistency between neighboring slices of 2D sweeps and exploring new representations for describing nodules merging and splitting in 3D space.