Abstract
During ultrasonic scanning processes, real-time lesion detection can assist radiologists in accurate cancer diagnosis. However, this essential task remains challenging and underexplored. General-purpose real-time object detection models can mistakenly report obvious false positives (FPs) when applied to ultrasound videos, potentially misleading junior radiologists. One key issue is their failure to utilize negative symptoms in previous frames, denoted as negative temporal contexts (NTC) [15]. To address this issue, we propose to extract contexts from previous frames, including NTC, with the guidance of inverse optical flow. By aggregating extracted contexts, we endow the model with the ability to suppress FPs by leveraging NTC. We call the resulting model UltraDet. The proposed UltraDet demonstrates significant improvement over previous state-of-the-arts and achieves real-time inference speed. We release the code, checkpoints, and high-quality labels of the CVA-BUS dataset [9] in https://github.com/HaojunYu1998/UltraDet.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Ultrasound is a widely-used imaging modality for clinical cancer screening. Deep Learning has recently emerged as a promising approach for ultrasound lesion detection. While previous works focused on lesion detection in still images [25] and offline videos [9, 11, 22], this paper explores real-time ultrasound video lesion detection. Real-time lesion prompts can assist radiologists during scanning, thus being more helpful to improve the accuracy of diagnosis. This task requires the model to infer faster than 30 frames per second (FPS) [19] and only previous frames are available for current frame processing.
Previous general-purpose detectors [1, 2] report simple and obvious FPs when applied to ultrasound videos, e.g. the red box in Fig. 1(a). These FPs, attributable to non-lesion anatomies, can mislead junior readers. These anatomies appear like lesions in certain frames, but typically show negative symptoms in adjacent frames when scanned from different positions. So experienced radiologists will refer to corresponding regions in previous frames, denoted as temporal contexts (TC), to help restrain FPs. If TC of a lesion-like region exhibit negative symptoms, denoted as negative temporal contexts (NTC), radiologists are less likely to report it as a lesion [15]. Although important, the utilization of NTC remains unexplored. In natural videos, as transitions from non-objects to objects are implausible, previous works [1, 2, 20] only consider inter-object relationships. As shown in Sect. 4.4, the inability to utilize NTC is a key issue leading to the FPs reported by general-purpose detectors.
To address this issue, we propose a novel UltraDet model to leverage NTC. For each Region of Interest (RoI) \(\mathcal {R}\) proposed by a basic detector, we extract temporal contexts from previous frames. To compensate for inter-frame motion, we generate deformed grids by applying inverse optical flow to the original regular RoI grids, illustrated in Fig. 1. Then we extract the RoI features from the deformed grids in previous frames and aggregate them into \(\mathcal {R}\). We call the overall process Negative Temporal Context Aggregation (NTCA). The NTCA module leverages RoI-level NTC which are crucial for radiologists but ignored in previous works, thereby effectively improving the detection performance in a reliable and interpretable way. We plug the NTCA module into a basic real-time detector to form UltraDet. Experiments on CVA-BUS dataset [9] demonstrate that UltraDet, with real-time inference speed, significantly outperforms previous works, reducing about 50% FPs at a recall rate of 0.90.
Our contributions are four-fold. (1) We identify that the failure of general-purpose detectors on ultrasound videos derives from their incapability of utilizing negative temporal contexts. (2) We propose a novel UltraDet model, incorporating an NTCA module that effectively leverages NTC for FP suppression. (3) We conduct extensive experiments to demonstrate the proposed UltraDet significantly outperforms the previous state-of-the-arts. (4) We release high-quality labels of the CVA-BUS dataset [9] to facilitate future research.
2 Related Works
Real-Time Video Object Detection is typically achieved by single-frame detectors, often with temporal information aggregation modules. One-stage detectors [5, 8, 16, 21] use only intra-frame information, DETR-based detectors [20, 26] and Faster R-CNN-based detectors [1, 2, 7, 14, 23, 28] are also widely utilized in video object detection. They aggregate temporal information by mining inter-object relationships without considering NTC.
Ultrasound Lesion Detection [10] can assist radiologists in clinical practice. Previous works have explored lesion detection in still images [25] and offline videos [9, 11, 22]. Real-time video lesion detection is underexplored. In previous works, YOLO series [17, 24] and knowledge distillation [19] are used to speed up inference. However, these works use single-frame detectors or post-process methods while learnable inter-frame aggregation modules are not adopted. Thus their performances are far from satisfactory.
Optical Flow [3] is used to guide ultrasound segmentation [12], motion estimation [4] and elastography [13]. For the first time, we use inverse optical flow to guide temporal context information extraction.
3 Method
In real-time video lesion detection, given the current frame \(\mathcal {I}_t\) and a sequence of T previous frames as \(\{\mathcal {I}_{\tau }\}_{\tau =t-T}^{t-1}\), the goal is to detect lesions in \(\mathcal {I}_t\) by exploiting the temporal information in previous frames as illustrated in Fig. 2.
3.1 Basic Real-Time Detector
The basic real-time detector comprises three main components: a lightweight backbone (e.g. ResNet34 [6]), a Region Proposal Network (RPN) [14], and a Temporal Relation head [2]. The backbone is responsible for extracting feature map \(\mathcal {F}_{\tau }\) of frame \(\mathcal {I}_{\tau }\). The RPN generates proposals consisting of boxes \(\mathcal {B}_{\tau }\) and proposal features \(\mathcal {Q}_{\tau }\) using RoI Align and average pooling:
where \(\tau =t-T,\cdots ,t-1,t\). To aggregate temporal information, proposals from all \(T+1\) frames are fed into the Temporal Relation head and updated with inter-lesion information extracted via a relation operation [7]:
where \(l=1,\cdots , L\) represent layer indices, \(\mathcal {B}\) and \(\mathcal {Q}\) are the concatenation of all \(\mathcal {B}_{\tau }\) and \(\mathcal {Q}_{\tau }\), and \(\mathcal {Q}^{0}=\mathcal {Q}\). We call this basic real-time detector BasicDet. The BasicDet is conceptually similar to RDN [2] but does not incorporate relation distillation since the number of lesions and proposals in this study is much smaller than in natural videos.
3.2 Negative Temporal Context Aggregation
In this section, we present the Negative Temporal Context Aggregation (NTCA) module. We sample \(T_{\text {ctxt}}\) context frames from T previous frames, then extract temporal contexts (TC) from context frames and aggregate them into proposals. We illustrate the NTCA module in Fig. 3 and elaborate on details as follows.
Inverse Optical Flow Align. We propose the Inverse Optical Flow Align (IOF Align) to extract TC features. For the current frame \(\mathcal {I}_t\) and a sampled context frame \(\mathcal {I}_{\tau }\) with \(\tau < t\), we extract TC features from the context feature map \(\mathcal {F}_{\tau }\) with the corresponding regions. We use inverse optical flow \(\mathcal {O}_{t\rightarrow \tau }\in \mathbb {R}^{H\times W\times 2}\) to transform the RoIs from frame t to \(\tau \): \(\mathcal {O}_{t\rightarrow \tau } = {\text {FlowNet}}(\mathcal {I}_{t}, \mathcal {I}_{\tau })\) where H, W represent height and width of feature maps. The \({\text {FlowNet}}(\mathcal {I}_{t}, \mathcal {I}_{\tau })\) is a fixed network [3] to predict optical flow from \(\mathcal {I}_{t}\) to \(\mathcal {I}_{\tau }\). We refer to \(\mathcal {O}_{t\rightarrow \tau }\) as inverse optical flow because it represents the optical flow in inverse chronological order from t to \(\tau \). We conduct IOF Align and average pooling to extract \(\mathcal {C}_{t,\tau }\):
where \({\text {IOFAlign}}(\mathcal {F}_{\tau }, \mathcal {B}_{t}, \mathcal {O}_{t\rightarrow \tau })\) extracts context features in \(\mathcal {F}_\tau \) from deformed grids generated by applying offsets \(\mathcal {O}_{t\rightarrow \tau }\) to the original regular grids in \(\mathcal {B}_{t}\), which is illustrated in the Fig. 1(b).
Temporal Aggregation. We concatenate \(\mathcal {C}_{t, \tau }\) in all \(T_{\text {ctxt}}\) context frames to form \(\mathcal {C}_{t}\) and enhance proposal features by fusing \(\mathcal {C}_{t}\) into \(\mathcal {Q}_{t}\):
where \(l=1,\cdots ,L\) represent layer indices, \(\mathcal {Q}_{\text {ctxt},t}^{0}=\mathcal {Q}_{t}\), and \({\text {Attention}}(Q,K,V)\) is Multi-head Attention [18]. We refer to the concatenation of all TC-enhanced proposal features in \(T+1\) frames as \(\mathcal {Q}_{\text {ctxt}}\). To extract consistent TC, the context frames of T previous frames are shared with the current frame.
3.3 UltraDet for Real-Time Lesion Detection
We integrate the NTCA module into the BasicDet introduced in Sect. 3.1 to form the UltraDet model, which is illustrated in Fig. 2. The head of UltraDet consists of stacked NTCA and relation modules:
During training, we apply regression and classification losses \(\mathcal {L}=\mathcal {L}_{\text {reg}} + \mathcal {L}_{\text {cls}}\) to the current frame. To improve training efficiency, we apply auxiliary losses \(\mathcal {L}_{\text {aux}}=\mathcal {L}\) to all previous T frames. During inference, the UltraDet model uses the current frame and T previous frames as inputs and generates predictions only for the current frame. This design endows the UltraDet with the ability to perform real-time lesion detection.
4 Experiments
4.1 Dateset
CVA-BUS Dateset. We use the open source CVA-BUS dataset that consists of 186 valid videos, which is proposed in CVA-Net [9]. We split the dataset into train-val (154 videos) and test (32 videos) sets. In the train-val split, there are 21423 frames with 170 lesions. In the test split, there are 3849 frames with 32 lesions. We focus on the lesion detection task and do not utilize the benign/malignant classification labels provided in the original dataset.
High-Quality Labels. The bounding box labels provided in the original CVA-BUS dataset are unsteady and sometimes inaccurate, leading to jiggling and inaccurate model predictions. We provide a new version of high-quality labels that are re-annotated by experienced radiologists. We reproduce all baselines using our high-quality labels to ensure a fair comparison. Visual comparisons of two versions of labels are available in supplementary materials. To facilitate future research, we will release these high-quality labels.
4.2 Evaluation Metrics
Pr80, Pr90. In clinical applications, it is important for detection models to be sensitive. So we provide frame-level precision values with high recall rates of 0.80 and 0.90, which we denote as Pr80 and Pr90, respectively.
FP80, FP90. We further report lesion-level FP rates as critical metrics. Frame-level FPs are linked by IoU scores to form FP sequences [24]. The number of FP sequences per minute at recall rates of 0.80 and 0.90 are reported as FP80 and FP90, respectively. The unit of lesion-level FP rates is seq/min.
AP50. We provide AP50 instead of mAP or AP75 because the IoU threshold of 0.50 is sufficient for lesion localization in clinical practice. Higher thresholds like 0.75 or 0.90 are impractical due to the presence of blurred lesion edges.
R@16. To evaluate the highest achievable sensitivity, we report the frame-level average recall rates of Top-16 proposals, denoted as R@16.
4.3 Implementation Details
UltraDet Settings. We use FlowNetS [3] as the fixed FlowNet in IOF Align and share the same finding with previous works [4, 12, 13] that the FlowNet trained on natural datasets generalizes well on ultrasound datasets. We set the pooling stride in the FlowNet to 4, the number of UltraDet head layers \(L=2\), the number of previous frames \(T=15\) and \(T_{\text {ctxt}}=2\), and the number of proposals is 16. We cached intermediate results of previous frames and reuse them to speed up inference. Other hyper-parameters are listed in supplementary materials.
Shared Settings. All models are built in PyTorch framework and trained using eight NVIDIA GeForce RTX 3090 GPUs. We use ResNet34 [6] as backbones and set the number of training iterations to 10,000. We set the feature dimensions of detection heads to 256 and baselines are re-implemented to utilize only previous frames. We refer to our code for more details.
4.4 Main Results
Quantitative Results. We compare performances of real-time detectors with the UltraDet in Table 1. We perform 4-fold cross-validation and report the mean values and standard errors on the test set to mitigate fluctuations. The UltraDet outperforms all previous state-of-the-art in terms of precision and FP rates. Especially, the Pr90 of UltraDet achieves 90.8%, representing a 5.4% absolute improvement over the best competitor, PTSEFormer [20]. Moreover, the FP90 of UltraDet is 5.7 seq/min, reducing about 50% FPs of the best competitor, PTSEFormer. Although CVA-Net [9] achieve comparable AP50 with our method, we significantly improve precision and FP rates over the CVA-Net [9].
Importance of NTC. In Fig. 4(a), we illustrate the FP ratios that can be suppressed by using NTC. The determination of whether FPs can be inhibited by NTC is based on manual judgments of experienced radiologists. We find that about 50%–70% FPs of previous methods are suppressible. However, by utilizing NTC in our UltraDet, we are able to effectively prevent this type of FPs.
Inference Speed. We run inference using one NVIDIA GeForce RTX 3090 GPU and report the inference speed in Table 1. The UltraDet achieves an inference speed of 30.4 FPS and already meets the 30 FPS requirement. Using TensorRT, we further optimize the speed to 35.2 FPS, which is sufficient for clinical applications [19].
Qualitative Results. Figure 4(b) visually compares BasicDet and UltraDet. The BasicDet reports FPs at \(t=30\) and 40 as it fails to leverage NTC when \(t=20\), while the UltraDet successfully suppresses FPs with the NTCA module.
4.5 Ablation Study
Effectiveness of Each Sub-module. We ablate the effectiveness of each sub-module of the NTCA module in Table 2. Specifically, we replace the IOF Align with an RoI Align and the Temporal Aggregation with a simple average pooling in the temporal dimension. The results demonstrate that both IOF Align and Temporal Aggregation are crucial, as removing either of them leads to a noticeable drop in performance.
Design of the NTCA Module. Besides RoI-level TC aggregation in UltraDet, feature-level aggregation is also feasible. We plug the optical flow feature warping proposed in FGFA [28] into the BasicDet and report the results in Table 3. We find RoI-level aggregation is more effective than feature-level, and both-level aggregation provides no performance gains. This conclusion agrees with radiologists’ skills to focus more on local regions instead of global information.
5 Conclusion
In this paper, we address the clinical challenge of real-time ultrasound lesion detection. We propose a novel Negative Temporal Context Aggregation (NTCA) module, imitating radiologists’ diagnosis processes to suppress FPs. The NTCA module leverages negative temporal contexts that are essential for FP suppression but ignored in previous works, thereby being more effective in suppressing FPs. We plug the NTCA module into a BasicDet to form the UltraDet model, which significantly improves the precision and FP rates over previous state-of-the-arts while achieving real-time inference speed. The UltraDet has the potential to become a real-time lesion detection application and assist radiologists in more accurate cancer diagnosis in clinical practice.
References
Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346 (2020)
Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Relation distillation networks for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7023–7032 (2019)
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Evain, E., Faraz, K., Grenier, T., Garcia, D., De Craene, M., Bernard, O.: A pilot study on convolutional neural networks for motion estimation from ultrasound images. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 67(12), 2565–2573 (2020)
Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: YOLOX: exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Lin, Z., Lin, J., Zhu, L., Fu, H., Qin, J., Wang, L.: A new dataset and a baseline model for breast lesion detection in ultrasound videos. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) MICCAI 2022, Part III. LNCS, vol. 13433, pp. 614–623. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16437-8_59
Liu, S., et al.: Deep learning in medical ultrasound analysis: a review. Engineering 5(2), 261–275 (2019)
Movahedi, M.M., Zamani, A., Parsaei, H., Tavakoli Golpaygani, A., Haghighi Poya, M.R.: Automated analysis of ultrasound videos for detection of breast lesions. Middle East J. Cancer 11(1), 80–90 (2020)
Nguyen, A., et al.: End-to-end real-time catheter segmentation with optical flow-guided warping during endovascular intervention. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9967–9973. IEEE (2020)
Peng, B., Xian, Y., Jiang, J.: A convolution neural network-based speckle tracking method for ultrasound elastography. In: 2018 IEEE International Ultrasonics Symposium (IUS), pp. 206–212. IEEE (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Spak, D.A., Plaxco, J., Santiago, L., Dryden, M., Dogan, B.: BI-RADS® fifth edition: a summary of changes. Diagn. Interv. Imaging 98(3), 179–190 (2017)
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636 (2019)
Tiyarattanachai, T., et al.: The feasibility to use artificial intelligence to aid detecting focal liver lesions in real-time ultrasound: a preliminary study based on videos. Sci. Rep. 12(1), 7749 (2022)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Vaze, S., Xie, W., Namburete, A.I.: Low-memory CNNs enabling real-time ultrasound segmentation towards mobile deployment. IEEE J. Biomed. Health Inform. 24(4), 1059–1069 (2020)
Wang, H., Tang, J., Liu, X., Guan, S., Xie, R., Song, L.: PTSEFormer: progressive temporal-spatial enhanced transformer towards video object detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, Part VIII. LNCS, vol. 13668, pp. 732–747. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_42
Wang, J., Song, L., Li, Z., Sun, H., Sun, J., Zheng, N.: End-to-end object detection with fully convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15849–15858 (2021)
Wang, Y., et al.: Key-frame guided network for thyroid nodule recognition using ultrasound videos. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) MICCAI 2022, Part IV. LNCS, vol. 13434, pp. 238–247. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16440-8_23
Wu, H., Chen, Y., Wang, N., Zhang, Z.: Sequence level semantics aggregation for video object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9217–9225 (2019)
Wu, X., et al.: CacheTrack-YOLO: real-time detection and tracking for thyroid nodules and surrounding tissues in ultrasound videos. IEEE J. Biomed. Health Inform. 25(10), 3812–3823 (2021)
Yap, M.H., et al.: Automated breast ultrasound lesions detection using convolutional neural networks. IEEE J. Biomed. Health Inform. 22(4), 1218–1226 (2017)
Zhou, Q., et al.: TransVOD: end-to-end video object detection with spatial-temporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 7853–7869 (2023)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417 (2017)
Acknowledgements
This work is supported by National Key R&D Program of China (2022ZD0114900) and National Science Foundation of China (NSFC62276005).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yu, H. et al. (2023). Mining Negative Temporal Contexts for False Positive Suppression in Real-Time Ultrasound Lesion Detection. In: Greenspan, H., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, vol 14225. Springer, Cham. https://doi.org/10.1007/978-3-031-43987-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-43987-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43986-5
Online ISBN: 978-3-031-43987-2
eBook Packages: Computer ScienceComputer Science (R0)