1 Introduction

Automatic segmentation of breast lesions in ultrasound (US) video is essential for computer-aided clinical examination and treatment [5]. Compared with the image segmentation, the segmentation in US video is more in line with the practice as it provides additional temporal information for the target object. It can be formulated as a binary labeling problem aiming to automatically segment target lesions in pixel level from a breast US video. This challenging task is rarely explored due to the lack of availability of published annotated US video datasets.

Although many convolutional neural networks (CNN), such as U-Net [9] and its variants [10, 13], have achieved outstanding performance on various benchmarks by learning robust representative features for US image segmentation. Directly applying these image segmentation methods to independently process each US video frame may fail to capture temporal context information and result in temporal inconsistency. Recently, Transformers become increasingly popular in video object segmentation tasks [3]. To model long-range relationships, Transformers employ a self-attention mechanism to calculate pairwise similarities between all input units. As a representative, space-time memory (STM) [7] leverages a memory network to read relevant information from a temporal buffer of all preceding frames. The STM performs dense matching in the feature space to capture context information with an unlimited receptive field. However, the non-local property of STM may result in mismatching since that lesions in US videos usually appear in local neighborhoods across memory frames. In addition, the memory would increase linearly with the length of videos during inference, which inevitably brings huge computational costs and may encounter memory overflow.

Fig. 1.
figure 1

Each column contains sample frames from a video in the breast lesion segmentation dataset, with high-quality pixel-level annotations marked in red. (Color figure online)

To tackle the above challenges, we first introduce a new US video dataset with accurate frame-wise annotation in pixel level for breast lesion segmentation; see Fig. 1 for examples. Then, we propose a Dynamic Parallel Spatial-Temporal Transformer (DPSTT) framework for US video segmentation. Specifically, following STM, we first extract pairs of key and value embedding from the current frame and all frames in the memory with a convolution-based encoder. Subsequently, we split the memory module into two parallel temporally- and spatially-decoupled Transformer blocks. In the temporally-decoupled block, the obtained key maps are spatially divided into multiple non-overlapped patches, and the attention is only calculated in the same regions between embedding of the current frame and those of memory frames. Such a temporal operation makes the modeling of pixel movements of breast lesions easier. By contrast, the spatially-decoupled block calculates the attention between the embedding of the current frame and that of the previous one in a non-local manner, which models the global similarity of stationary background texture between two adjacent frames. Moreover, to prevent unlimited growth of memory during inference, we also develop a non-uniform adaptive memory selection scheme to dynamically update the frames in the memory based on the similarity metric. In summary, the contributions of our method are threefold: (1) We are the first to present an annotated benchmark dataset specifically designed for the task of breast lesion segmentation in US videos, which would promote the progress of the medical video process. (2) We propose a Dynamic Parallel Spatial Temporal Transformer (DPSTT) framework for US video segmentation to improve lesion segmentation performance with higher computational efficiency. (3) We have conducted extensive experiments to evaluate the proposed DPSTT. Experimental results demonstrate that our method outperforms state of the arts by a large margin.

Fig. 2.
figure 2

Overview of the proposed DPSTT framework. Our network consists of two encoders (a memory encoder for the past frames, and a query encoder for the current frame), a parallel pair of spatially- and temporally-decoupled Transformer and a decoder. The memory encoder takes an RGB image and its corresponding lesion mask as input, whilst the query encoder only takes an RGB image. Here we repeat the memory encoder for the previous frame for better visualization.

2 Method

The overall framework of the proposed DPSTT is shown in Fig. 2. Given a US video sequence, we regard the current frame as the query frame, the past frames with annotated object masks as the memory frames. During the video segmentation process, both memory frames and the query frame are first encoded into pairs of key and value maps through the memory encoder and the query encoder, respectively. Different from STM that constructs a global memory read module over the video space, we disentangle the non-local attention into two parallel lightweight modules along the spatial and temporal dimensions. The keys and values further go through the spatially-decoupled and temporally-decoupled Transformers. Specifically, the spatial Transformer takes the keys and values from the query and the previous frames to extract the global background context information, while the temporal Transformer takes the keys and values at the same local regions from the query and memory frames to aggregate the temporal movement of target objects at the same time. The outputs of the spatial and temporal Transformers are finally sent to the decoder, which estimates the target mask for the query frame.

2.1 Query and Memory Encoder

Both the query and memory encoders share the same structure except for the input. Similar to STM, we utilize the ResNet50 [4] as the backbone network and modify the first convolutional layer to take a 4-channel input for the memory encoder. Then two parallel convolutional layers are utilized to further embed the backbone network output into a pair of key and value maps by reducing its channel size to 1/8 and 1/2, respectively. We denote by \(\textbf{k}^Q \in \mathbb {R}^{H \times W \times C/8}\) and \(\textbf{v}^Q \in \mathbb {R}^{H\times W \times C/2}\) the key and the value maps for the query frame, where H is the height, W is the width and C represents the channel size of the feature map. Similarly, each individual of T memory frames (\(T \ge 1\)) is independently embedded into key and value maps. The resulting key and value maps are represented as \(\textbf{k}^M \in \mathbb {R}^{T \times H\times W\times C/8}\) and \(\textbf{v}^M \in \mathbb {R}^{T \times H\times W \times C/2}\). For ease of description, we also denote the corresponding key and value maps of the previous frame by \(\textbf{k}^P\) and \(\textbf{v}^P\), which have the same resolution as \(\textbf{k}^Q\) and \(\textbf{v}^Q\).

2.2 Parallel Spatial Temporal Transformer

Different from the memory read module in STM that simultaneously processes similarity matching between all pixels of the query frame and memory frames, we disentangle this expensive module into two much easier components: a temporally-decoupled Transformer for extracting local features along the temporal dimension, and a spatially-decoupled Transformer block for capturing global features between the query frame and its previous frame in a non-local manner.

For the temporal Transformer, given the memory key \(\textbf{k}^M\) and the query key \(\textbf{k}^Q\), we split them into \(s^2\) non-overlapped patches along both height and width dimensions. Each region is represented by \(\textbf{k}^{M_k}_{ij} \in \mathbb {R}^{H/s \times W/s \times C/8}, k\in [1,T]\) for the memory and \(\textbf{k}^Q_{ij} \in \mathbb {R}^{H/s \times W/s \times C/8}\) for the query respectively, where \(i,j\in [1,s]\) denote the index of the local region. We then group the local memory regions by temporal dimension, i.e., \(P^M_{ij} = \{\textbf{k}^{M_1}_{ij}, \cdots , \textbf{k}^{M_T}_{ij}\}\). Then the temporal Transformer measures the local similarity with:

$$\begin{aligned} f(P^{M_k}_{ij}, \textbf{k}^Q_{ij}) = \text {Softmax}\big (\text {exp}(P^{M_k}_{ij} \otimes \textbf{k}^Q_{ij})\big ), \end{aligned}$$
(1)

where \(\otimes \) denotes the dot product. With the soft weights, the memory values are subsequently retrieved by a weighted summation as follows:

$$\begin{aligned} \textbf{v}_{ij}^T =\sum _{k=1}^T f(P^{M_k}_{ij}, \textbf{k}^Q_{ij})\textbf{v}_{ij}^{M_k}. \end{aligned}$$
(2)

The resulting \(\textbf{v}_{ij}^T\) concatenated with the query value at the same location, is further organized into a new tensor according to its location index to produce the temporal Transformer output \(\textbf{y}^T\). By doing so, continuous movements of the target object in a smaller spatial region can be detected without the disturbance of redundant temporal features.

For the spatial Transformer, we assume the previous frame has less movement or appearance difference compared with the query frame. The previous frame with its estimated mask would help provide coarse guidance for the query frame. Therefore, the similarity matching between the previous and the query frame is performed in a non-local manner with

$$\begin{aligned} f(\textbf{k}^Q, \textbf{k}^P) = \text {Softmax}\big ( \text {exp}(\textbf{k}^Q \otimes \textbf{k}^P)\big ). \end{aligned}$$
(3)

Then the output of the spatial Transformer is generated by

$$\begin{aligned} \textbf{y}^S = [\textbf{v}^Q, f(\textbf{k}^Q, \textbf{k}^P) \textbf{v}^P ]. \end{aligned}$$
(4)

In this way, the spatial Transformer pays more attention to the global static background context information. Finally, these two decoupled spatial and temporal Transformers are calculated in parallel and their outputs \(\textbf{y}^S\) and \(\textbf{y}^T\) are concatenated and further refined by a convolutional operation.

2.3 Decoder

The decoder takes the refined output of the decoupled spatial and temporal Transformers to estimate the lesion mask for the query frame. We follow the refinement module in [7] to build the decoder, which upscales the feature map gradually by a set of residual convolutional blocks. Finally, We minimize the binary cross-entropy(BCE) loss and the dice loss between the object masks \(\hat{Y}\) and the ground truth labels Y.

2.4 Dynamic Memory Selection

Though spatial and temporal Transformers benefit from storing enough information in the memory frames, storing all the past frames is impossible and may lead to memory overflow. To eliminate unnecessary features, we propose a simple yet effective selection mechanism to dynamically update the memory frames. We maintain a fixed K memory frames for segmenting the \(t^{th}\) query frame if \(t > K\), or all of the past frames as memory frames if \(t \le K\). Then we update the memory buffer by selecting the most K relevant frames. For example, assume that we have memory frames M for segmenting the \(t^{th}\) frame (\(t>K\)), when moving forward to the \((t+1)^{th}\) frame, we adopt the cosine metric and sort the resulting similarity values with

$$\begin{aligned} \text {Sort}\{Cos(\textbf{k}^{Q_{t+1}}, \textbf{k}^{M_{k}}), Cos(\textbf{k}^{Q_{t+1}}, \textbf{k}^{M_{t}})\}, k\in [1,K]. \end{aligned}$$
(5)

Then the memory frames can be updated by adding the \(t^{th}\) frame at the tail and removing the one with the least similarity value. It is noteworthy that our dynamic memory selection speeds up the temporal attention calculation, and performs online adaptation without additional training.

Time Complexity. With such a pipeline, we significantly reduce the computational complexity of memory read module in STM from \(\mathcal {O}(TH^2W^2C)\) into \(\mathcal {O}(KH^2W^2C/(s^4))\) for the temporally-decoupled block and \(\mathcal {O}(H^2W^2C)\) for the spatially-decoupled block. Although \(T=3\) is chosen in the training process, T would increase linearly with the video length during inference, which would be much larger than a predetermined K. In addition, the computation of the temporal Transformer would be more efficient when s becomes larger.

Table 1. Quantitative comparison with different methods on the proposed dataset.

3 Experiments

3.1 Dataset and Implementation

Here we describe the newly collected dataset, specifically designed for the task of breast lesion segmentation in US videos. Sample frames of the breast US videos are shown in Fig. 1. The breast US dataset comprises 63 video sequences, one video sequence per person, 4619 frames annotated with pixel-level ground truth by experts. These videos are collected from different US devices and their spatial resolution varies from \(580 \times 600\) to \(600 \times 800\). To ease training, we further crop the video sequences to a spatial resolution of \(300\times 200\). For quantitative comparison, we employ several widely used segmentation evaluation metrics, namely, Jaccard similarity coefficient (Jaccard), Dice similarity coefficient (Dice), Precision and Recall; see [11] for their definitions. Moreover, we adopt five-fold cross-validation on our dataset to statistically test different video segmentation methods.

We implement our network using the PyTorch framework with an NVIDIA RTX 3090 graphics card. In our experiments, all the input US frames are empirically resized to 240\(\times \)240 and the training epoch is set to 100. During training, we sample T(\(T=3\)) temporally ordered frames with random skip N frames (\(N \le 5\)) from a US video. We set the batch size to 4 and learning rate to \(1e-4\). We use the binary cross-entropy(BCE) loss and the dice loss with the weight of 0.5 and 0.5 during the training process. For the temporal Transformer, we set s to be 2. During inference, when the size of memory frames exceeds K (\(K=10\)), the dynamic selection mechanism is activated to eliminate the redundant frame for the memory.

Fig. 3.
figure 3

Visual comparison with competitive video-based methods on two breast lesion cases.

3.2 Comparison with State-of-the-Art Methods

Quantitative Comparisons. As shown in Table 1, we qualitatively compare our method with state-of-the-art methods, including image-based segmentation methods (UNet [9], UNet++ [13], TransUNet [2] and SERT [12]) and video-based segmentation methods (OSVOS [8], ViViT [1], STM [7], AFB-URR [6]). From the results, we can observe that the video-based methods are prone to outperform image-based methods with higher evaluation scores, which demonstrates that leveraging temporal information provides promising benefits for breast lesion segmentation in US videos. More importantly, among all video-based segmentation methods, our DPSTT has achieved the highest Jaccard score of 73.64 and the Dice score of 82.55. It indicates that our method, combined with a CNN-based encoder and spatial-temporal Transformers, is able to simultaneously learn both high- and low-level cues and thus achieves significant improvements over those pure Transformer methods, such as SERT and ViViT. Table 1 also reports the inference speed performance of different methods. Due to the parallel operation of the decoupled Transformers, our method reduces much redundant computation and thus runs the fastest compared with other video-based approaches.

Qualitative Comparisons. Figure 3 visualizes the qualitative comparison of lesion masks among different video segmentation methods. We can observe that our method can provide more precise masks than STM and AFB-URR with more consistent boundaries. This is because our dynamic selection mechanism provides the most relevant memory and preserves the spatial consistency.

Table 2. Ablation study on different transformer combination strategies. T denotes the temporal block and S is the spatial block. \(\times N\) denotes repeating N times.

3.3 Ablation Study

The Effect of Transformers. We evaluate the effect of different Transformers by removing the spatial and temporal blocks separately in Table 2. It shows that the combination of both modules consistently results in better performance. This is because any decoupled Transformer can’t simultaneously capture both stationary texture and moving information. We further compare different stacking strategies in the row 3–5. It shows that stacking such two different Transformers in parallel performs better than in an interweaving way, no matter starting from spatial or temporal blocks. Moreover, it is also observed that only one parallel temporal and spatial blocks are good enough to capture representative features.

Table 3. Ablation study on different memory selection strategies.

The Effect of Dynamic Memory. We investigate different memory selection strategies by comparing the segmentation performance with skip memory (every five frames) in STM, random memory of fixed size as well as our dynamic memory in Table 3. The results show that the random memory performs worst than the other two strategies. This phenomenon verifies our assumption that good enough memory can provide benefits for segmentation performance.

4 Conclusion

In this paper, we present the first pixel-wise annotated benchmark dataset for breast lesion segmentation in US videos. Then a Dynamic Parallel Spatial-Temporal Transformer framework is proposed for US video segmentation. Moreover, an efficient dynamic memory selection is further developed based on the similarity metric to prevent memory overflow. Finally, we conduct extensive experiments to evaluate the efficacy of our method.