Keywords

1 Introduction

Ultrasound is a widely-used imaging modality for clinical cancer screening. Deep Learning has recently emerged as a promising approach for ultrasound lesion detection. While previous works focused on lesion detection in still images [25] and offline videos [9, 11, 22], this paper explores real-time ultrasound video lesion detection. Real-time lesion prompts can assist radiologists during scanning, thus being more helpful to improve the accuracy of diagnosis. This task requires the model to infer faster than 30 frames per second (FPS) [19] and only previous frames are available for current frame processing.

Fig. 1.
figure 1

Illustration of Negative Temporal Context Aggregation (NTCA) module. (a) Our motivation: mining negative temporal contexts for FP suppression. (b) The NTCA module leverages temporal contexts to suppress the FP. (Color figure online)

Previous general-purpose detectors [1, 2] report simple and obvious FPs when applied to ultrasound videos, e.g. the red box in Fig. 1(a). These FPs, attributable to non-lesion anatomies, can mislead junior readers. These anatomies appear like lesions in certain frames, but typically show negative symptoms in adjacent frames when scanned from different positions. So experienced radiologists will refer to corresponding regions in previous frames, denoted as temporal contexts (TC), to help restrain FPs. If TC of a lesion-like region exhibit negative symptoms, denoted as negative temporal contexts (NTC), radiologists are less likely to report it as a lesion [15]. Although important, the utilization of NTC remains unexplored. In natural videos, as transitions from non-objects to objects are implausible, previous works [1, 2, 20] only consider inter-object relationships. As shown in Sect. 4.4, the inability to utilize NTC is a key issue leading to the FPs reported by general-purpose detectors.

To address this issue, we propose a novel UltraDet model to leverage NTC. For each Region of Interest (RoI) \(\mathcal {R}\) proposed by a basic detector, we extract temporal contexts from previous frames. To compensate for inter-frame motion, we generate deformed grids by applying inverse optical flow to the original regular RoI grids, illustrated in Fig. 1. Then we extract the RoI features from the deformed grids in previous frames and aggregate them into \(\mathcal {R}\). We call the overall process Negative Temporal Context Aggregation (NTCA). The NTCA module leverages RoI-level NTC which are crucial for radiologists but ignored in previous works, thereby effectively improving the detection performance in a reliable and interpretable way. We plug the NTCA module into a basic real-time detector to form UltraDet. Experiments on CVA-BUS dataset [9] demonstrate that UltraDet, with real-time inference speed, significantly outperforms previous works, reducing about 50% FPs at a recall rate of 0.90.

Our contributions are four-fold. (1) We identify that the failure of general-purpose detectors on ultrasound videos derives from their incapability of utilizing negative temporal contexts. (2) We propose a novel UltraDet model, incorporating an NTCA module that effectively leverages NTC for FP suppression. (3) We conduct extensive experiments to demonstrate the proposed UltraDet significantly outperforms the previous state-of-the-arts. (4) We release high-quality labels of the CVA-BUS dataset [9] to facilitate future research.

2 Related Works

Real-Time Video Object Detection is typically achieved by single-frame detectors, often with temporal information aggregation modules. One-stage detectors [5, 8, 16, 21] use only intra-frame information, DETR-based detectors [20, 26] and Faster R-CNN-based detectors [1, 2, 7, 14, 23, 28] are also widely utilized in video object detection. They aggregate temporal information by mining inter-object relationships without considering NTC.

Ultrasound Lesion Detection [10] can assist radiologists in clinical practice. Previous works have explored lesion detection in still images [25] and offline videos [9, 11, 22]. Real-time video lesion detection is underexplored. In previous works, YOLO series [17, 24] and knowledge distillation [19] are used to speed up inference. However, these works use single-frame detectors or post-process methods while learnable inter-frame aggregation modules are not adopted. Thus their performances are far from satisfactory.

Optical Flow [3] is used to guide ultrasound segmentation [12], motion estimation [4] and elastography [13]. For the first time, we use inverse optical flow to guide temporal context information extraction.

3 Method

Fig. 2.
figure 2

Illustration of UltraDet model. The yellow and green frames are sampled as context frames, and their feature maps are inputs of the NTCA module. (Color figure online)

In real-time video lesion detection, given the current frame \(\mathcal {I}_t\) and a sequence of T previous frames as \(\{\mathcal {I}_{\tau }\}_{\tau =t-T}^{t-1}\), the goal is to detect lesions in \(\mathcal {I}_t\) by exploiting the temporal information in previous frames as illustrated in Fig. 2.

3.1 Basic Real-Time Detector

The basic real-time detector comprises three main components: a lightweight backbone (e.g. ResNet34 [6]), a Region Proposal Network (RPN) [14], and a Temporal Relation head [2]. The backbone is responsible for extracting feature map \(\mathcal {F}_{\tau }\) of frame \(\mathcal {I}_{\tau }\). The RPN generates proposals consisting of boxes \(\mathcal {B}_{\tau }\) and proposal features \(\mathcal {Q}_{\tau }\) using RoI Align and average pooling:

$$\begin{aligned} \mathcal {Q}_{\tau }={\text {AvgPool}}\left( {\text {RoIAlign}}(\mathcal {F}_{\tau },\mathcal {B}_{\tau })\right) \end{aligned}$$
(1)

where \(\tau =t-T,\cdots ,t-1,t\). To aggregate temporal information, proposals from all \(T+1\) frames are fed into the Temporal Relation head and updated with inter-lesion information extracted via a relation operation [7]:

$$\begin{aligned} \mathcal {Q}^{l}=\mathcal {Q}^{l-1}+{\text {Relation}}(\mathcal {Q}^{l-1}, \mathcal {B}) \end{aligned}$$
(2)

where \(l=1,\cdots , L\) represent layer indices, \(\mathcal {B}\) and \(\mathcal {Q}\) are the concatenation of all \(\mathcal {B}_{\tau }\) and \(\mathcal {Q}_{\tau }\), and \(\mathcal {Q}^{0}=\mathcal {Q}\). We call this basic real-time detector BasicDet. The BasicDet is conceptually similar to RDN [2] but does not incorporate relation distillation since the number of lesions and proposals in this study is much smaller than in natural videos.

3.2 Negative Temporal Context Aggregation

In this section, we present the Negative Temporal Context Aggregation (NTCA) module. We sample \(T_{\text {ctxt}}\) context frames from T previous frames, then extract temporal contexts (TC) from context frames and aggregate them into proposals. We illustrate the NTCA module in Fig. 3 and elaborate on details as follows.

Fig. 3.
figure 3

Illustration of the Negative Temporal Context Aggregation module.

Inverse Optical Flow Align. We propose the Inverse Optical Flow Align (IOF Align) to extract TC features. For the current frame \(\mathcal {I}_t\) and a sampled context frame \(\mathcal {I}_{\tau }\) with \(\tau < t\), we extract TC features from the context feature map \(\mathcal {F}_{\tau }\) with the corresponding regions. We use inverse optical flow \(\mathcal {O}_{t\rightarrow \tau }\in \mathbb {R}^{H\times W\times 2}\) to transform the RoIs from frame t to \(\tau \): \(\mathcal {O}_{t\rightarrow \tau } = {\text {FlowNet}}(\mathcal {I}_{t}, \mathcal {I}_{\tau })\) where H, W represent height and width of feature maps. The \({\text {FlowNet}}(\mathcal {I}_{t}, \mathcal {I}_{\tau })\) is a fixed network [3] to predict optical flow from \(\mathcal {I}_{t}\) to \(\mathcal {I}_{\tau }\). We refer to \(\mathcal {O}_{t\rightarrow \tau }\) as inverse optical flow because it represents the optical flow in inverse chronological order from t to \(\tau \). We conduct IOF Align and average pooling to extract \(\mathcal {C}_{t,\tau }\):

$$\begin{aligned} \mathcal {C}_{t,\tau }={\text {AvgPool}}\left( {\text {IOFAlign}}(\mathcal {F}_{\tau }, \mathcal {B}_{t}, \mathcal {O}_{t\rightarrow \tau })\right) \end{aligned}$$
(3)

where \({\text {IOFAlign}}(\mathcal {F}_{\tau }, \mathcal {B}_{t}, \mathcal {O}_{t\rightarrow \tau })\) extracts context features in \(\mathcal {F}_\tau \) from deformed grids generated by applying offsets \(\mathcal {O}_{t\rightarrow \tau }\) to the original regular grids in \(\mathcal {B}_{t}\), which is illustrated in the Fig. 1(b).

Temporal Aggregation. We concatenate \(\mathcal {C}_{t, \tau }\) in all \(T_{\text {ctxt}}\) context frames to form \(\mathcal {C}_{t}\) and enhance proposal features by fusing \(\mathcal {C}_{t}\) into \(\mathcal {Q}_{t}\):

$$\begin{aligned} \mathcal {Q}_{\text {ctxt},t}^{l}=\mathcal {Q}_{\text {ctxt},t}^{l-1} + {\text {Attention}}(\mathcal {Q}_{\text {ctxt},t}^{l-1}, \mathcal {C}_{t}, \mathcal {C}_{t}) \end{aligned}$$
(4)

where \(l=1,\cdots ,L\) represent layer indices, \(\mathcal {Q}_{\text {ctxt},t}^{0}=\mathcal {Q}_{t}\), and \({\text {Attention}}(Q,K,V)\) is Multi-head Attention [18]. We refer to the concatenation of all TC-enhanced proposal features in \(T+1\) frames as \(\mathcal {Q}_{\text {ctxt}}\). To extract consistent TC, the context frames of T previous frames are shared with the current frame.

3.3 UltraDet for Real-Time Lesion Detection

We integrate the NTCA module into the BasicDet introduced in Sect. 3.1 to form the UltraDet model, which is illustrated in Fig. 2. The head of UltraDet consists of stacked NTCA and relation modules:

$$\begin{aligned} \mathcal {Q}^l=\mathcal {Q}^{l}_{\text {ctxt}}+{\text {Relation}}(\mathcal {Q}^{l}_{\text {ctxt}},\mathcal {B}). \end{aligned}$$
(5)

During training, we apply regression and classification losses \(\mathcal {L}=\mathcal {L}_{\text {reg}} + \mathcal {L}_{\text {cls}}\) to the current frame. To improve training efficiency, we apply auxiliary losses \(\mathcal {L}_{\text {aux}}=\mathcal {L}\) to all previous T frames. During inference, the UltraDet model uses the current frame and T previous frames as inputs and generates predictions only for the current frame. This design endows the UltraDet with the ability to perform real-time lesion detection.

4 Experiments

4.1 Dateset

CVA-BUS Dateset. We use the open source CVA-BUS dataset that consists of 186 valid videos, which is proposed in CVA-Net [9]. We split the dataset into train-val (154 videos) and test (32 videos) sets. In the train-val split, there are 21423 frames with 170 lesions. In the test split, there are 3849 frames with 32 lesions. We focus on the lesion detection task and do not utilize the benign/malignant classification labels provided in the original dataset.

High-Quality Labels. The bounding box labels provided in the original CVA-BUS dataset are unsteady and sometimes inaccurate, leading to jiggling and inaccurate model predictions. We provide a new version of high-quality labels that are re-annotated by experienced radiologists. We reproduce all baselines using our high-quality labels to ensure a fair comparison. Visual comparisons of two versions of labels are available in supplementary materials. To facilitate future research, we will release these high-quality labels.

Table 1. Quantitative results of real-time lesion detection on CVA-BUS [9].

4.2 Evaluation Metrics

Pr80, Pr90. In clinical applications, it is important for detection models to be sensitive. So we provide frame-level precision values with high recall rates of 0.80 and 0.90, which we denote as Pr80 and Pr90, respectively.

FP80, FP90. We further report lesion-level FP rates as critical metrics. Frame-level FPs are linked by IoU scores to form FP sequences [24]. The number of FP sequences per minute at recall rates of 0.80 and 0.90 are reported as FP80 and FP90, respectively. The unit of lesion-level FP rates is seq/min.

AP50. We provide AP50 instead of mAP or AP75 because the IoU threshold of 0.50 is sufficient for lesion localization in clinical practice. Higher thresholds like 0.75 or 0.90 are impractical due to the presence of blurred lesion edges.

R@16. To evaluate the highest achievable sensitivity, we report the frame-level average recall rates of Top-16 proposals, denoted as R@16.

4.3 Implementation Details

UltraDet Settings. We use FlowNetS [3] as the fixed FlowNet in IOF Align and share the same finding with previous works [4, 12, 13] that the FlowNet trained on natural datasets generalizes well on ultrasound datasets. We set the pooling stride in the FlowNet to 4, the number of UltraDet head layers \(L=2\), the number of previous frames \(T=15\) and \(T_{\text {ctxt}}=2\), and the number of proposals is 16. We cached intermediate results of previous frames and reuse them to speed up inference. Other hyper-parameters are listed in supplementary materials.

Shared Settings. All models are built in PyTorch framework and trained using eight NVIDIA GeForce RTX 3090 GPUs. We use ResNet34 [6] as backbones and set the number of training iterations to 10,000. We set the feature dimensions of detection heads to 256 and baselines are re-implemented to utilize only previous frames. We refer to our code for more details.

4.4 Main Results

Quantitative Results. We compare performances of real-time detectors with the UltraDet in Table 1. We perform 4-fold cross-validation and report the mean values and standard errors on the test set to mitigate fluctuations. The UltraDet outperforms all previous state-of-the-art in terms of precision and FP rates. Especially, the Pr90 of UltraDet achieves 90.8%, representing a 5.4% absolute improvement over the best competitor, PTSEFormer [20]. Moreover, the FP90 of UltraDet is 5.7 seq/min, reducing about 50% FPs of the best competitor, PTSEFormer. Although CVA-Net [9] achieve comparable AP50 with our method, we significantly improve precision and FP rates over the CVA-Net [9].

Fig. 4.
figure 4

(a) Ratios of FPs that are suppressible by leveraging NTC. (b) Visual comparisons of BasicDet and UltraDet prediction results at recall 0.90. Blue boxes are true positives and red boxes are FPs. (Color figure online)

Importance of NTC. In Fig. 4(a), we illustrate the FP ratios that can be suppressed by using NTC. The determination of whether FPs can be inhibited by NTC is based on manual judgments of experienced radiologists. We find that about 50%–70% FPs of previous methods are suppressible. However, by utilizing NTC in our UltraDet, we are able to effectively prevent this type of FPs.

Inference Speed. We run inference using one NVIDIA GeForce RTX 3090 GPU and report the inference speed in Table 1. The UltraDet achieves an inference speed of 30.4 FPS and already meets the 30 FPS requirement. Using TensorRT, we further optimize the speed to 35.2 FPS, which is sufficient for clinical applications [19].

Qualitative Results. Figure 4(b) visually compares BasicDet and UltraDet. The BasicDet reports FPs at \(t=30\) and 40 as it fails to leverage NTC when \(t=20\), while the UltraDet successfully suppresses FPs with the NTCA module.

4.5 Ablation Study

Table 2. Ablation study of each NTCA sub-module.

Effectiveness of Each Sub-module. We ablate the effectiveness of each sub-module of the NTCA module in Table 2. Specifically, we replace the IOF Align with an RoI Align and the Temporal Aggregation with a simple average pooling in the temporal dimension. The results demonstrate that both IOF Align and Temporal Aggregation are crucial, as removing either of them leads to a noticeable drop in performance.

Table 3. Design of the NTCA Module.

Design of the NTCA Module. Besides RoI-level TC aggregation in UltraDet, feature-level aggregation is also feasible. We plug the optical flow feature warping proposed in FGFA [28] into the BasicDet and report the results in Table 3. We find RoI-level aggregation is more effective than feature-level, and both-level aggregation provides no performance gains. This conclusion agrees with radiologists’ skills to focus more on local regions instead of global information.

5 Conclusion

In this paper, we address the clinical challenge of real-time ultrasound lesion detection. We propose a novel Negative Temporal Context Aggregation (NTCA) module, imitating radiologists’ diagnosis processes to suppress FPs. The NTCA module leverages negative temporal contexts that are essential for FP suppression but ignored in previous works, thereby being more effective in suppressing FPs. We plug the NTCA module into a BasicDet to form the UltraDet model, which significantly improves the precision and FP rates over previous state-of-the-arts while achieving real-time inference speed. The UltraDet has the potential to become a real-time lesion detection application and assist radiologists in more accurate cancer diagnosis in clinical practice.