1 Introduction

Video Object Segmentation (VOS), as a fundamental task in the computer vision community, has attracted more and more attention in recent years due to its potential application in autonomous driving, object tracking [3, 12, 13], activity recognition [55], and video editing, etc. In this paper, we focus on semi-supervised VOS, which provides the target objects’ masks in the first frame, and the algorithms should produce the segmentation masks for those objects in the subsequent frames. Under this setting, VOS remains challenging due to object occlusion, deformation, appearance variation, and similar object confusion in video sequences.

Fig. 1
figure 1

(a) relationships between pixels in the query frame and pixels in reference frames. Temporal relationships (b) are relationships among pixels in different frames, representing the correspondence of target objects among reference and query frames. Spatial relationships (c) are relationships among pixels in a specific frame of reference and query frames, including object appearance information for target localization and segmentation

When processing videos in sequential order, the natural idea is to use more reference/historical frames that contain abundant temporal information. Recently, state-of-the-art VOS performance has been achieved by matching-based algorithms [5, 11, 22, 27, 32, 39, 42, 43, 45, 48], while most of which still show two problems like complicated and redundant pipeline and inadequately modeling of spatial-temporal relationships when referring to historical frames for segmenting.

First, many existing methods [5, 15, 19, 22, 27, 32, 42, 48] usually use a complicated and redundant two-extractor pipeline, in which the query extractor encodes features of the current frame and the memory/reference extractor encodes historical information from reference frames. This two-extractor pipeline is flexible for encoding reference sets of different sizes; however, containing many redundant parameters and increasing the model’s complexity. Siamese architecture effectively reduces the number of parameters and simplifies the complicated pipeline, while existing ways [14, 16, 23, 31, 47] are of limited use and unable to keep the flexibility and effectiveness. For example, the segmentation masks’ abundant edges and contour features are not fully leveraged by directly concatenating the predicted masks with high-level semantic features. In addition, concatenating the previous frame’s mask with the query frame may bring significant displacement shifting errors. However, using optical flow to warp the mask is time-consuming.

Second, these methods mostly neglect the spatial relationships (inside each frame) and do not fully model the temporal relationships (among different frames). However, the spatial and temporal relationship is crucial for learning the robust target appearance across frames and handling practical scenarios such as object occlusion, deformation, and appearance variation. To better depict this point, we define two relationships in this paper, i.e., temporal relationships (Fig. 1 (b)) and spatial relationships (Fig. 1 (c)). The former is the relationships among pixels in different frames, representing the correspondence of target objects among past and current frames, which is vital for learning robust global target object features and helps handle appearance change across frames. The latter represents the relationships among pixels in a specific frame, including object appearance information for target localization and segmentation, which helps obtain accurate mask boundaries and is essential for learning local target object structure as explored in [25]. A group of matching-based methods  [5, 11, 15, 19, 22, 27, 32, 37, 39, 42, 48, 53] provide partial solutions for capturing above correspondence and achieve competitive performance. Among them, the Space-Time Memory (STM) based approaches [5, 15, 19, 22, 27, 32, 42, 48] have achieved great success. The basic idea of these methods is to compute the similarities of target objects between the current and past frames by feature matching. However, as illustrated in Fig. 1(a), most of these methods only compute attentions among pixels in the query frame against pixels in each reference frame, ignoring the temporal dependency among historical frames and spatial correlations of pixels inside a specific frame. There are a few methods that pay attention to these issues. For instance, EGMN [27] proposes a fully-connected graph to capture cross-frame correlation, effectively exploiting the temporal relationships. However, EGMN still omits spatial relationships.

To address the above two problems, in this paper, we propose a new framework for VOS, which is a compact and unified single-extractor pipeline and has strong spatial and temporal interaction ability.

Specifically, to represent the reference sets and query frames in a unified way, we develop a plain yet effective feature extractor that has a dynamic input adapter and accepts the reference sets and the query frames in the meantime, significantly simplifying the existing VOS framework while keeping the effectiveness and flexibility. It is based on the assumption that the convolution network can be generic to different inputs of visual patterns. Therefore, we designed the dynamic input adapter to encode the reference sets and the query frames to different visual patterns. And then, a convolution network is used to map these visual patterns into feature embedding. The dynamic input adapter uses different layers to encode different inputs in practice. For reference sets, the RGB image, the mask’s foreground, and the mask’s background are encoded and fused to enhance the target appearance. At the same time, only the RGB image is encoded for the query frames. By this means, the dynamic feature extractor can encode the inputs in a unified way and keeps the flexibility and effectiveness of the separate extractors but with a compact architecture.

Moreover, since the dependencies both among different frames and inside every frame are crucial for this task, we introduce the vision transformer to jointly capture the spatial and temporal relationships, generating discriminate spatial-temporal features for segmenting. Our model takes the features of the reference sets and the query frame as the input sequence and exploits the transformers to establish spatial-temporal dependency simultaneously. Also, we design a Target Attention Block (TAB) to extract the target’s mask features from the query frame, helping obtain the target mask prediction from the outputs of the transformer. Above all, By the cascaded design of the proposed dynamic feature extractor, transformer-based relationship module, and target-enhanced segmentation, our model implements a unified and compact pipeline for VOS.

Finally, we explore potential solutions, such as sequence organizers, to improve the model’s efficiency. Since the computational complexity of the self-attention mechanism is proportional to the square of the length of the input sequences. And not all pixels in the reference sets are important for the target segmentation of the query frame. Therefore, we design the sequence organizers to compress the redundant reference representation. By this means, compared to the vanilla model, we achieve \(\sim \)50% faster inference speed with only a slight 0.2% (\( J \& F\)) drop in segmentation quality on DAVIS17 validation, as shown in Section 4.5.

Our main contributions can be summarized as follows:

  • We proposed a compact single-extractor framework to simplify the existing VOS pipeline. Specifically, we designed a dynamic feature extractor to present the two kinds of inputs, i.e., reference sets (history frames with predicted masks) and query frames (current frames), in a unified way, containing fewer parameters while keeping the effectiveness and flexibility of the two extractors.

  • Considering that the dependencies among different frames and inside every frame are crucial for this task, we attach the vision transformer to the dynamic feature extractor, generating discriminate spatial-temporal features for segmenting. Our model is robust for appearance variation, occlusion, and confusion with sufficient spatial-temporal interaction.

  • We comprehensively evaluate the proposed model on three benchmark datasets, including DAVIS 2016/2017 [34, 35] and YouTube-VOS [49]. The results demonstrate the effectiveness and efficiency of our method in comparison with the previous approaches. Also, we do extended experiments to explore how to improve our model’s efficiency.

2 Related works

2.1 Tracking-based methods

These methods [3, 13, 41, 46] integrate object tracking techniques to indicate target location and spatial area for segmentation. SiamMask [46] adds a mask branch on SiamRPN [17] to narrow the gap between tracking and segmentation. FTAN-DTM [13] takes object segmentation as a sub-task of tracking, introducing the “tracking-by-detection” model into VOS. SAT [3] fuses object tracking and segmentation into a truly unified pipeline. It combines SiamFC++ [50] and proposes an estimation-feedback mechanism to switch between the mask box and tracking box, making segmentation and tracking tasks enhance each other. The integration of the tracker helps improve the inference speed, while the accuracy of tracking tasks often limits these methods’ performance.

2.2 Matching-based methods

Recently, state-of-the-art performance has been achieved by matching-based methods [2, 11, 22, 27, 30, 32, 37, 39, 42, 43, 45, 48, 56], which perform feature matching to learn target object appearances offline. VideoMatch [11] measures similarity by soft matching with foreground and background features. FEELVOS [39] and CFBI [53] perform the nearest neighbor matching between the current frame and the first and previous frames in the feature space. STM [32] introduces an external memory to store past frames’ features and uses the attention-based matching method to retrieve information from memory. KMN [37] applies Query-to-Memory matching with a kernelized memory read to reduce the non-locality of the STM. RMNet [48] proposes to replace STM’s global-to-global matching with local-to-local matching to alleviate the ambiguity of similar objects. EGMN [27] organizes the memory network as a fully connected graph to store frames as nodes and capture cross-frame relationships by edges. SwiftNet [42] designed Pixel-Adaptive Memory to compress spatiotemporal redundancy. However, these methods do not fully utilize the spatial-temporal relationships among reference sets and query frames. In this paper, we introduce a vision transformer to model spatial-temporal dependency, which can help handle large object appearance changes.

2.3 Transformer-based methods

Recently, transformers have achieved great success in vision tasks like image classification [7], object detection [2], semantic segmentation [44], object tracking [51, 52], etc. Due to the importance of spatial and temporal relationships for segmenting, we also employ the vision transformer in the VOS task, which is inspired by DETR [2]. Different from DETR and MaskFormer [4], which only model spatial relationships in a specific frame with the transformer, we fully exert the long-range dependencies modeling power of the transformer to simultaneously exploit spatial and temporal relationships among pixels of past frames and the current frame, which is vital and benefit to the VOS task. In addition, the proposed dynamic feature extractor and transformer complement each other and form a unified architecture. The former adaptively encodes two types of inputs, i.e., query frames and reference sets, while the latter effectively models two types of relationships among the input sequences.

There are also some transformer-based methods, SST [8], JOINT [29] and AOT [54]. SST uses the transformer’s encoder with sparse attention to capture the spatial-temporal information among the current and preceding frames. However, mask representations are not explored in SST. JOINT combines inductive and transductive learning and extends the transduction branch to a transformer architecture. Nevertheless, its network structure is complicated. AOT proposed Identification Embedding that encodes all masks simultaneously and a Long Short-Term Transformer that captures spatial-temporal dependencies. AOT achieves good performance with fast inference speed, while short-term attention implies the temporal smoothness assumption, which may not be robust to fast-moving and small objects. Besides, SST, JOINT, and AOT did not employ the transformer’s decoder and could not enjoy its substantial power.

Fig. 2
figure 2

The overall architecture of the proposed dynamic feature extractor. The feature extractor is used to extract the features of the current frame and reference sets in a unified way. "\(+\)", indicates adding operation. For a better view, we only illustrate two reference frames

2.4 Feature extractors

Reference sets (history frames with predicted masks) and query frames (current frames) are essential inputs for semi-supervised VOS. The former implies historical information, while the latter contains the appearance of the current target. Only one extractor was used to encode the inputs in the early years. MaskTrack [33] concatenated the query frame with the previous mask prediction as the input of a single ConvNet. After that, siamese architecture became popular for using two reference frames for cues. RGMP [31] and AGSS-VOS [23] concatenate the current frame with the previous frame’s mask or warped mask to form a 4-channel input so as the reference sets. Then a shared extractor with a 4-channel input layer is used to extract features. For more temporal information, STM-based methods [5, 22, 27, 32, 42, 48] used two extractors, .i.e, a 4-channel memory/reference extractor and a query extractor to extract features from reference sets and the query frame, respectively. The two-extractor pipeline is flexible for encoding reference sets of different sizes while swollen and containing many redundant parameters. We argue that a more compact and flexible pipeline can be implemented with the proposed dynamic feature extractor.

3 Methods

The overview of our framework is illustrated in Fig. 3. It mainly consists of a dynamic feature extractor, a vision transformer, a target attention block, and a segmentation head. When segmenting a specific frame, we firstly use the dynamic feature extractor to extract the features of the current frame and reference sets. The outputs of the extractor are fed into a bottleneck layer to reduce the channel number. Then features are flattened before feeding into a vision transformer, which simultaneously models the temporal and spatial relationships. Moreover, the target attention block takes both the transformer’s encoder and decoder’s outputs as input and then outputs the feature maps, representing the target mask features. Finally, a segmentation head is attached after the target attention block to obtain the predicted object mask.

3.1 Dynamic feature extractor

As discussed in Section 1, we need a unified feature extractor that can effectively extract the features of the reference sets and the query frame and map them into an embedding space to be ready for feeding into the following vision transformer. In order to utilize more temporal cues, the existing pipelines mostly use two separate extractors to operate these two inputs, respectively, which are flexible and effective for encoding different sizes of reference sets but contain many redundant parameters and increase the model’s complexity. Some methods [14, 31] apply siamese architecture to encode the two types of inputs, which is more lightweight and compact while showing obvious problems like insufficient use of mask features and significant shifted error caused by concatenating the previous mask with the current frame. To overcome the above issues, we combine both advantages, i.e., maintaining the effectiveness and flexibility of two-extractor pipelines while being more lightweight and compact like Siamese architecture, and design a dynamic feature extractor to represent the two types of inputs in a unified way. As demonstrated by the experiments in Table 4, compared to the two-extractor pipeline, our model equipped with the proposed dynamic feature extractor has much fewer parameters (about 20% reduction, “Dynamic" vs. “Independent") while maintaining effectiveness and flexibility.

Dynamic input adapter Specifically, we design a dynamic input adapter to adaptively encode two types of inputs, i.e., query frames (RGB frames) and reference sets (the pairs of RGB frames with corresponding object masks). As shown in Fig. 2, when taking the RGB frames as input, it will go through the first path, which has one regular convolution operation. The second path will be used for reference sets, containing three convolutions to encode the RGB frame, the object’s foreground mask, and the background mask. The output features of the three convolutions are added together to represent the reference sets. Our method can use arbitrary convolution networks as the feature extractor by replacing the first layer with the dynamic input adapter. Here we employ the first four stages of ResNet [10] as the feature extractor. After going through the dynamic input adapter, the features from the query frame and reference sets are first concatenated along the temporal dimension and then fed into the convolution network (CNN). Finally, the reference sets and current frame are mapped to feature maps \({\textbf {f}}\in \mathbb {R}^{(T+1) \times C \times H \times W}\), where H, W, C are the height, width, and channel. T is the number of reference pairs.

Before feeding into the vision transformer, we use a 1x1 convolution layer to reduce the spatial channel of the feature maps from C to d \((d<C)\), resulting in new feature maps \({\textbf {f}}^{'}\in \mathbb {R}^{(T+1) \times d \times H \times W}\). Then, the spatial and temporal dimensions of \({\textbf {f}}^{'}\) are flattened into one dimension, producing feature vectors \({\textbf {X}}\in \mathbb {R} ^{(T+1)HW \times d}\), which serves as the input of the transformer encoder.

3.2 Relationship modeling

We introduce the vision transformer to model the relationships of the two types of inputs, i.e., reference sets (history frames with predicted masks) and query frames (current frames), making the whole pipeline simple and modularized. Transformers have strong capabilities for modeling spatial-temporal relationships. First, the positional encoding explicitly introduces space-time position information, which could help the encoder model spatial-temporal relationships among pixels in the input frames. Second, the encoder could learn the target object’s correspondence among the input frames and model the target object’s structure in a specific frame. Third, the decoder could predict the spatial positions of the target objects in the query frame and focus on the most relevant object, which learns robust target representations for the target object and empowers our network to handle similar object confusion better.

Positional encoding The transformer’s core component self-attention module is permutation invariant. However, both spatial and temporal positional information is vital for establishing spatial and temporal relationships and accurate object segmentation. Equipping with the space-time location information in feature maps, the encoder could better capture the spatial and temporal dependency among all elements in the input sequences, helping our network handle challenging situations like object occlusion and deformation. Therefore, explicitly embedding space-time position information into the transformer model is essential. We add sinusoidal positional encoding PE [38] to the embedded features \({\textbf {X}}\) to form the inputs \({\textbf {Z}}\) of the transformer. Mathematically,

$$\begin{aligned} {\textbf {Z}} = {\textbf {X}} + PE \end{aligned}$$
(1)
$$\begin{aligned} PE(pos, 2i) = sin(pos/10000^{2i/d}) \end{aligned}$$
(2)
$$\begin{aligned} PE(pos, 2i+1) = cos(pos/10000^{2i/d}) \end{aligned}$$
(3)

where pos and i are the spatial-temporal position and the dimension of the features \({\textbf {X}}\), respectively.

Transformer encoder The transformer encoder is used to model the spatial-temporal relationships among elements of the input sequences. It takes features \(\textbf{Z}\) as input and outputs encoded features \({\textbf {E}}\). The encoder consists of L encoder layers, each with a standard architecture, including a multi-head self-attention module and a fully connected feed-forward network. The multi-head self-attention module captures spatial-temporal relationships from different representation sub-spaces.

At each encoder layer l, let \(\textbf{z}^{l-1}_{p, t} \in \mathbb {R}^{d}\) represents an element of representation \(\textbf{Z}^{l-1}(\textbf{Z}^{0}=\textbf{Z}, \textbf{Z}^{L}=\textbf{E})\) encoded by the preceding encoder layer, where p and t denote the spatial and temporal position, respectively. For m-th attention head, the query/key/value vector (\(\textbf{q}^{l, m}_{p, t} / \textbf{k}^{l, m}_{p, t} / \textbf{v}^{l, m}_{p, t}\)) from the element \(\textbf{z}^{l-1}_{p, t}\) is computed by:

$$\begin{aligned} \begin{aligned} \textbf{q}^{l, m}_{p, t} = \textbf{W}^{l, m}_{q}\textbf{z}^{l-1}_{p, t}\\ \textbf{k}^{l, m}_{p, t} = \textbf{W}^{l, m}_{k}\textbf{z}^{l-1}_{p, t}\\ \textbf{v}^{l, m}_{p, t} = \textbf{W}^{l, m}_{v}\textbf{z}^{l-1}_{p, t} \end{aligned} \end{aligned}$$
(4)

The self-attention weights are computed by:

$$\begin{aligned} \varvec{\alpha }^{l, m}_{p, t} = \sigma (\frac{\textbf{q}^{l, m}_{p, t}}{\sqrt{d_m}} \cdot \left[ \{\textbf{k}^{l, m}_{p', t'}\}_{\begin{array}{c} p'=1,\cdots ,HW \\ t'=1,\cdots ,T+1 \end{array}}\right] ) \end{aligned}$$
(5)

Then, the multi-head self-attention feature is calculated by

$$\begin{aligned} \textbf{s}^{l}_{p, t} = \sum _{m=1}^{M}\textbf{W}_{o}^{l, m}[\sum _{t'=1}^{T+1}\sum _{p'=1}^{HW} \mathbf {(\varvec{\alpha }}^{l, m}_{p, t})_{p',t'} \cdot \textbf{v}^{l, m}_{p', t'}] \end{aligned}$$
(6)

where T represents the number of the reference frames, \(\textbf{W}_{q}^{l, m}, \textbf{W}_{k}^{l, m}, \textbf{W}_{v}^{l, m}\in \mathbb {R}^{d_m\times d}\) and \(\textbf{W}_{o}^{l, m} \in \mathbb {R}^{d\times d_{m}}\) are learnable weights (\(d_{m}=d/M\) by default), \(\sigma \) indicates the softmax function. Note that we compute attention along the spatial-temporal dimension. Thus we can model the spatial relationships and temporal relationships in the meanwhile.

After the multi-head self-attention module, residual connections and the layer normalization (\(\textrm{LN}\)) are used. Furthermore, features are further passed through a \(\textrm{FFN}\) attached with residual connections and layer normalization to acquire the output features \(\textbf{Z}^{l}\) of encoder layer l.

Transformer decoder The transformer decoder aims to focus on the most relevant object in the query frame and help predict the spatial positions of the target. It takes encoded features \(\textbf{E}\) and target query \(\textbf{x}_{q}\) as input and output decoded features \(\textbf{x}_{o}\). We only utilize one target query in the decoder to query the features of the specific target object. The decoder also consists of L decoder layers, each including a multi-head self-attention module, a multi-head cross-attention module, and a fully connected feed-forward network. The multi-head self-attention module in our model integrates the target information from different representation sub-spaces. Moreover, the multi-head cross-attention module is mainly leveraged to retrieve target object features from the encoder.

At each decoder layer \(l'\), let \(\textbf{x}^{l'-1}(\textbf{x}^{0}=\textbf{x}_q, \textbf{x}^{L}=\textbf{x}_o) \in \mathbb {R}^{d}\) represents the representation extracted by the preceding decoder layer. The multi-head self-attention module is similar to that in the transformer encoder layer. Since there is only one target query, it calculates the attention weights against itself. Therefore, the computation of self-attention feature \(\textbf{x}^{l'}_{s}\) can be simplified as:

$$\begin{aligned} \textbf{x}^{l'}_{s} = \sum _{m'=1}^{M}\textbf{W}_{so}^{m'}(\textbf{W}_{sv}^{m'}\textbf{x}^{l'-1}) \end{aligned}$$
(7)

where \(m'\) indexes the attention head in multi-head self-attention module, \(\textbf{W}_{sv}^{m'}\in \mathbb {R}^{d_{m'}\times d}\) and \(\textbf{W}_{so}^{m'} \in \mathbb {R}^{d\times d_{m'}}\) are learnable weights (\(d_{m'}=d/M\) by default).

Then features \(\hat{\textbf{x}}^{l'}_s\) are passed through a multi-head cross-attention module after the residual connections and the layer normalization (\(\textrm{LN}\)):

$$\begin{aligned} \hat{\textbf{x}}^{l'}_s = \textrm{LN}(\textbf{x}^{l'}_{s}+\textbf{x}^{l'-1}) \end{aligned}$$
(8)

Let \(\textbf{e}_{p, t} \in \mathbb {R}^{d}\) represents an element of \(\textbf{E}\), p and t denote the spatial and temporal position, respectively. For \(m'\)-th (\(m' \le M\)) attention head in multi-head cross-attention module, the key and value vectors \(\textbf{k}^{l', m'}_{p, t}, \textbf{v}^{l', m'}_{p, t}\) are computed as:

$$\begin{aligned} \begin{aligned} \textbf{k}^{l', m'}_{p, t} = \textbf{W}^{l', m'}_{k}\textbf{e}_{p, t} \\ \textbf{v}^{l', m'}_{p, t} = \textbf{W}^{l', m'}_{v}\textbf{e}_{p, t} \end{aligned} \end{aligned}$$
(9)

The cross-attention weights are computed by:

$$\begin{aligned} \varvec{\alpha }^{l', m'}_{s} = \sigma (\frac{\textbf{W}_{q}^{l', m'}\hat{\textbf{x}}^{l'}_s}{\sqrt{d_{m'}}} \cdot \left[ \{\textbf{k}^{l', m'}_{p', t'}\}_{\begin{array}{c} p'=1,\cdots ,HW \\ t'=1,\cdots ,T+1 \end{array}}\right] ) \end{aligned}$$
(10)

Then the cross-attention feature \(\textbf{x}^{l'}_{c}\) is calculated by:

$$\begin{aligned} \textbf{x}^{l'}_{c} = \sum _{m'=1}^{M}\textbf{W}_{o}^{l', m'}[\sum _{t=1}^{T+1}\sum _{p=1}^{HW}(\varvec{\alpha }^{l', m'}_{s})_{p, t} \cdot \textbf{v}^{l', m'}_{p, t}] \end{aligned}$$
(11)

where T denotes the size of the reference set, \(\textbf{W}_{q}^{l', m'}, \textbf{W}_{k}^{l', m'},\) \(\textbf{W}_{v}^{l', m'}\in \mathbb {R}^{d_{m'}\times d}\) and \(\textbf{W}_{o}^{l', m'} \in \mathbb {R}^{d\times d_{m'}}\) are learnable weights(\(d_{m'}=d/M\) by default). \(\sigma \) indicates the softmax function.

Similar to the encoder layer, residual connections and the layer normalization (\(\textrm{LN}\)) are used after the multi-head cross-attention module. Moreover, features are further passed through a \(\textrm{FFN}\) attached with residual connections and layer normalization to acquire the output features \(\textbf{x}^{l'}\) of decoder layer \(l'\).

3.3 Segmentation

Target attention block. To obtain the target mask prediction from the outputs of the transformer, the model needs to extract the target’s mask features from the query frame. We design a Target Attention Block (TAB) to achieve this goal. TAB computes the attentions between the query frame’s features \(\textbf{E}_Q\) in features \(\textbf{E}\), and the output features \(\textbf{x}_o\) from the decoder. \(\textbf{x}_o\) and \(\textbf{E}_Q\) are fed into a multi-head self-attention module (with M head) to obtain the attention maps, which boost the features of foreground and suppress the disturbance of the background. We concatenate the attention maps with \(\textbf{E}_Q\) as the input \(\textbf{S}\) of the following segmentation head to enhance the target features. The above procedure can be formulated as follows:

$$\begin{aligned} Attn_{i}(\textbf{x}_o, \textbf{E}_Q) = \sigma (\frac{(\textbf{W}_{q}^{i}\textbf{x}_o)^{T}(\textbf{W}_{k}^{i}\textbf{E}_Q)}{\sqrt{d_{i}}}) \end{aligned}$$
(12)
$$\begin{aligned} \textbf{S} = [\textbf{E}_Q, Attn_{1}(\textbf{x}_o, \textbf{E}_Q), \cdots , Attn_{M}(\textbf{x}_o, \textbf{E}_Q)] \end{aligned}$$
(13)

where i indexes the attention head, \(\textbf{W}_{q}^{i}, \textbf{W}_{k}^{i}\in \mathbb {R}^{d_{i}\times d}\), are learnable weights (\(d_{i}=d/M\) by default).

Segmentation head The features \(\textbf{S}\) are fed into a segmentation head which outputs the final mask prediction. Here, we use the refine module used in [31, 32] as the building block to construct our segmentation head. It consists of two blocks, each taking the previous stage’s output and the current frame’s feature maps from the feature extractor at the corresponding scale through skip connections. The refine module upscales the compressed feature maps by a factor of two at a time. Then a 2-channel convolution and a softmax operation are attached behind blocks to attain the predicted mask in 1/4 scale of the input image. Finally, we use bi-linear interpolation to upscale the predicted mask to the original scale.

Multi-object segmentation Our framework can be extended to multi-object segmentation easily. Specifically, the network first predicts the mask for every target object. Then, a soft aggregation operation is used to merge all the predicted maps. We apply this way during both the training and inference to keep both stages consistent. For each location l in predicted mask \(\textbf{M}_{i}\) of object \(i(i<N)\), the probability \(p_{l, i}\) after soft aggression operation can be expressed as:

$$\begin{aligned} p_{l, i} = \sigma (\textrm{logit}(\hat{p}_{l, i})) = \frac{\hat{p}_{l, i}/(1 - \hat{p}_{l, i})}{\sum _{j=0}^{N-1}{\hat{p}_{l, j}/(1 - \hat{p}_{l, j})}} \end{aligned}$$
(14)

where N is the number of objects. \(\sigma \) and logit represent the softmax and logit functions, respectively. The probability of the background is obtained by subtracting from 1 after computing the output of the merged foreground.

3.4 Training and inference

Training Our proposed model only requires short training video clips since it has no temporal smoothness assumptions. Nonetheless, our model can still learn long-term dependency. Just like most STM-based methods [19, 22, 27, 32, 37], we synthesize video clips by applying data augmentations (random affine, color, flip, resize, and crop) on a static image of datasets [6, 9, 18, 24]. Then we use the synthetic videos to pretrain our model. This pre-training procedure helps our model be robust against various object appearances and categories. After that, we train our model on real videos. We randomly select T frames from a video sequence of DAVIS [34, 35] or YouTube-VOS [49] and apply data augmentation on those frames to form a training video clip. By doing so, we can expect our model to learn long-range spatial-temporal information. We add cross-entropy loss \(\mathcal {L}_{cls}\) and mask IoU loss \(\mathcal {L}_{IoU}\) as the multi-object training loss \(\mathcal {L}\), which can be expressed as:

$$\begin{aligned} \mathcal {L} = \frac{1}{N} \sum _{i=0}^{N-1}[\mathcal {L}_{cls}(\textbf{M}_{i}, \textbf{Y}_{i}) + \mathcal {L}_{IoU}(\textbf{M}_{i}, \textbf{Y}_{i})] \end{aligned}$$
(15)
$$\begin{aligned} \mathcal {L}_{cls}(\textbf{M}_{i}, \textbf{Y}_{i}) = -\frac{1}{|\Omega |}\sum _{p\in \Omega }[\textbf{Y}_{i} \textrm{log}(\frac{\exp (\textbf{M}_{i})}{\sum _{j=0}^{N-1}(\exp (\textbf{M}_{j}))})]_{p} \end{aligned}$$
(16)
$$\begin{aligned} \mathcal {L}_{IoU}(\textbf{M}_{i}, \textbf{Y}_{i}) = 1 - \frac{{\sum _{p \in \Omega } {\min ({\textbf {Y}}^{p}_{i},{\textbf {M}}^{p}_{i})} }}{{\sum _{p \in \Omega } {\max ({\textbf {Y}}^{p}_{i},{\textbf {M}}^{p}_{i})}}} \end{aligned}$$
(17)

where \(\Omega \) denotes the set of all pixels in the object mask, \(\textbf{M}_{i}, \textbf{Y}_{i}\) represent the predicted mask and ground truth of object i, N is the number of objects. Note that N is set to 1 when segmenting a single object.

Inference Our model uses past frames with corresponding predicted masks to segment the current frame during the online reference phase. To balance the inference speed and accuracy, we do not use external memory to store every past frame’s features but only use the first frame with ground truth and the previous frame with its predicted masks as the reference sets. Because the former always provides the most reliable information, and the latter is the most similar to the current frame. Note that our model is flexible and can use more reference frames to obtain more historical information for segmenting the current frame.

4 Experiments

In this section, we first introduce the implementation details of our approach and the datasets, and the evaluation metrics. Then we perform extensive experiments to demonstrate that our model consistently outperforms or obtains a comparable performance with the state-of-the-art methods on DAVIS [34, 35] and YouTube-VOS [49] benchmarks. We also give some qualitative results to show the effectiveness of our model. Next, we conduct comprehensive ablation studies to analyze the effect of the individual components of our method and some configurations. Finally, we explore how to improve efficiency and strike a balance between segmentation quality and inference speed.

Table 1 Comparison with the state-of-the-art on the DAVIS16 and DAVIS17-val. ‘OL’ indicates the use of online-learning strategy. ‘+YT’ means the use of YouTube-VOS for training. Runtime of other methods was obtained from the corresponding papers
Table 2 Compare to the state of the art on the DAVIS17 test-dev set and YouTube-VOS 2018 validation set. ‘OL’ indicates the use of online-learning strategy. The subscripts of \(\mathcal {J}\) and \(\mathcal {F}\) on YouTube-VOS denote seen objects (s) and unseen objects (u). The metric overall means the average of \(\mathcal {J}_s, \mathcal {J}_u, \mathcal {F}_s, \mathcal {F}_u\)

4.1 Implementation details

We use the first four stages of ResNet50 [10] pretrained on ImageNet [24] and replace its input layer with the proposed dynamic input adapter to form our feature extractor. The number of transformer encoder layers and decoder layers is set to \(L=6\). The multi-head attention layers have \(M = 8\) heads, width \(d = 256\), while the feed-forward networks have hidden units of 2048. Dropout ratio of 0.1 is used. The proposed model is trained with the input resolution of 480p, and the length T of the training video clip is set to 2. Similar to [32], the maximum temporal interval of sampling increases by 5 every 20 training epochs. We freeze all batch normalization layers and minimize our loss using AdamW optimizer (\(\beta = (0.9, 0.999)\), \(eps = 10^{-8}\), and the weight decay is \(10^{-4}\)) with the initial learning rate \(lr=10^{-4}\). During training, we adopt a bootstrapping strategy for the cross-entropy loss, where only the top 40% pixels with maximum training loss are considered. The model is trained with batchsize 4 for 160 epochs on 4 TITAN RTX GPUs, taking about 1.5 days. Note that our model is flexible for different sizes of the reference sets. In the inference stage, to balance the accuracy and efficiency, our model with input resolution 480p only refers to the first and previous frames to segment the current frame. We conduct all inference experiments on a single TITAN RTX GPU.

Fig. 3
figure 3

Overview of our model. The feature extractor is used to extract the features of the current frame and reference sets. The vision transformer is exploited to model the temporal and spatial relationships. The target attention block (TAB) is used to extract the target mask features. The segmentation head is designed to obtain the predicted object mask. "\(+\)", "\(\mathrm C\)" indicate adding and concatenating operation, respectively. For a better view, we only illustrate two reference frames

Fig. 4
figure 4

Qualitative results on the DAVIS2017-val. The groundtruth is visualized in the first row, and the next three rows show comparisons of our method with STM [32] and CFBI [53]. Our model handles object occlusion better due to the strong ability of spatial-temporal modeling

4.2 Datasets and evaluation metrics

We evaluate our approach on DAVIS [34, 35] and YouTube-VOS [49] benchmarks. Both DAVIS2016 and DAVIS2017 have experimented. DAVIS2016 is an annotated single-object dataset containing 30 training and 20 validation video sequences. DAVIS2017 is a multi-objects dataset expanded from DAVIS2016, including 60 training video sequences, 30 validation video sequences, and 30 test video sequences. The youTube-VOS dataset is a large-scale VOS dataset with 3471 training videos and 474 validation videos. And each video contains a maximum of 12 objects. The validation set includes seen objects from 65 training categories and unseen objects from 26 categories, appropriate for evaluating algorithms’ generalization performance. We use the evaluation metrics provided by the DAVIS benchmark to evaluate our model. \( J \& F\) evaluates the general quality of the segmentation results, J evaluates the mask IoU, and F estimates the quality of contours.

4.3 Comparison with the state-of-the-art

DAVIS We compare the proposed model with the state-of-the-art methods on DAVIS benchmark [34, 35]. We also present the results trained with additional data from YouTube-VOS [49]. The evaluation results on DAVIS16-val and DAVIS17-val are reported in Table 1. When adding YouTube-VOS for training, our method achieves state-of-the-art performance on DAVIS17-val (83.9% in \( J \& F\)), outperforming the online-learning methods with a large margin and having higher performance than the matching-based methods such as STM, RMNet, and CFBI. Specifically, our model outperforms transformer-based SST with 1.4% in \( J \& F\), surpasses JOINT with a gap of 0.4% in \( J \& F\), and exceeds AOT-B with 1.8% in \( J \& F\). When only using DAVIS for training, our model achieves better quantitative results than those with the same configuration and even better than several methods like FEELVOS and AGAME, which apply YouTube-VOS for training. On DAVIS16-val, our model has comparable performance with state-of-the-art methods. Compared to KMN, our model has the same \( J \& F\) score with a higher J score while a slightly lower F score. Since DAVIS 2016 is a single object dataset, segmentation details such as boundaries play an important role in performance evaluation. We believe that the Hide-and-Seek training strategy, which provides more precise boundaries, helps KMN a lot. We also report the results on the DAVIS17 test-dev in Table 2. Our model outperforms all the online-learning methods. Except for slightly lower than KMN of 0.3% in \( J \& F\), our model surpasses all the methods in the second part. Note that we only use simple data augmentation when pretraining on static image datasets, while KMN applies more complicated pretraining strategies, which helps improve the performance.

YouTube-VOS Table 2 compares with state-of-the-art methods on YouTube-VOS 2018 validation [49]. On this benchmark, our method obtains an overall score of 81.8% and outperforms all the methods in the first and second parts, demonstrating that the proposed method is robust and effective. Specifically, our model surpasses STM by 2.4% in the overall score. Note that we only refer to the first and previous frames to segment the current frame, while STM contains a large memory bank that saves a new memory frame every five. Also, our model outperforms KMN and CFBI with gaps of both 0.4% in the overall score. Besides, it surpasses the most related transformer-based SST. Our model achieves the best F scores while not the best for J scores. We explain that the model may pay more attention to spatial relationships to obtain more accurate target boundaries and acquire higher overall scores on YouTube-VOS 2018.

Fig. 5
figure 5

Qualitative results o DAVIS2017 test-dev and YouTube-VOS 2018 validation. Compared to STM, our model performs better when segmenting highly similar objects and fast-moving objects

Table 3 Ablation studies of mask utilization and reference sets with input resolution 240p on DAVIS 2017 validation set

Qualitative results The visualization (Fig. 3) results on DAVIS17-val are shown in Fig. 4, and the qualitative results on DAVIS17 test-dev and YouTube-VOS 2018 validation are shown in Fig. 5. The proposed method handles object occlusion better due to the strong ability of spatial-temporal modeling. Also, our model performs better when segmenting highly similar objects and fast-moving objects.

Fig. 6
figure 6

Visualization of attention maps from the transformer encoder/decoder layers. The first row of each sample ("car", "camel", and "twirl girl") shows the attention maps from the fourth transformer encoder layer. We take the center 16x16 patch of the object of the current frame (t-th frame) as the query to get the attention weights. The second row is the visualization of the cross-attention weights from the fourth transformer decoder layers with only one target query. The decoder layers pay more attention to the target object and can reduce the interference of the background

Table 4 Impacts of different types of feature extractors. Models are tested with the input resolution of 240p on DAVIS17-val

4.4 Ablation study

We conduct all the ablation experiments on DAVIS17 validation [35]. The model used in this section does not do pre-training on synthetic videos, and the input resolution is 240p unless specified. Moreover, we test the model with only the first and previous frames referred by default. Here we list the ablation studies about dynamic feature extractor, mask utilization, reference sets, transformer structure, backbone, training strategy, and input resolution.

Dynamic feature extractor In Table 4, we compare our model equipped with the proposed dynamic feature extractor with two variants: i) the existing approach of using two independent extractors (as in STM [32]; denoted ‘Independent’), and ii) using a siamese architecture and concatenating the object mask to the reference frame features (as in AGAME [14]; denoted as ‘Siamese’). Results showed that our model employs fewer parameters (about 20% reduction) than i) but obtains higher performance (+7.8% in \( J \& F\) score) than ii).

Mask utilization To demonstrate the effectiveness of our dynamic feature extractor, we implement three typical ways to utilize the predicted masks of past frames. (1) the predicted masks are multiplied with the encoded features of the RGB frame, denoted as ‘multiply’; (2) the encoded features of the RGB frame and the predicted mask are multiplied first and then added to the former, denoted as ‘residual’; (3) the predicted masks and the RGB frame are fed into a dynamic input adapter, denoted as ‘adapter’. As shown in Table 3, compared to directly multiplying the predicted mask with encoded features (line 1) and fusing the mask with residual structure (line 2), our dynamic input adapter gains \( 15.1\%(J \& F)\) and \( 8.0\%(J \& F)\) improvement.

Reference sets We test how reference sets affect the performance of our proposed model. We experiment with four types of reference set configurations: (1) Only the first frame with the ground-truth masks; (2) Only the previous frame with its predicted mask; (3) Both the first and previous frame with their masks; (4) The reference set is dynamically updated by appending new frames with the predicted masks every five frames. As Table 3 shows, our model could achieve superior performance even with two frames referred. Interestingly, we find that updating the memory every five frames as STM [32] for VOS may not be beneficial in all methods because low-quality segmentation results of historical frames may mislead subsequent mask prediction.

Table 5 Ablation studies of different components with input resolution 240p on DAVIS 2017 validation set. ‘TD’ denotes the transformer decoder
Table 6 Ablation studies of different backbone with input resolution 240p on DAVIS 2017 validation set

Transformer structure We visualize attention maps from the transformer encoder/decoder layers in Fig. 6. The attention maps from the fourth encoder layer show that the encoder focuses on the target object but is still disturbed by the background. In contrast, the transformer decoder can eliminate background influence (Table 4) and pay more attention to the target object. We also do quantitative experiments to explore the effectiveness and necessity of the transformer decoder in Table 5. It can be seen that by equipping with the transformer decoder, our model obtains \( 1.2\%(J \& F)\) improvement over removing it. Therefore, it is essential to employ the transformer’s decoder.

Backbone We experiment with different backbones, ResNet18, ResNet50 [10] and Swin Transformer [26]. As shown in Table 6, our model with a smaller backbone ResNet18 runs faster (7 fps improvement) than ResNet50 while the performance drops 4.1% (\( J \& F\)). The model with Swin-small as backbone gains 0.5% (\( J \& F\)) than ResNet50 but contains more parameters and runs slower (5 fps drop). Therefore, we take ResNet50 as the backbone to achieve a more balanced performance.

Training strategy We conduct experiments to explore the effectiveness of pretraining on synthetic videos. As Table 7 shows, our model only drops by 1.5% (\( J \& F\)) without pretraining, which means our proposed approach can learn general and robust target object appearance even trained with a small dataset.

Table 7 Training data analysis on DAVIS 2017 validation set. We do abaltion studies to explore how the pre-training affects our model’s performance
Table 8 Input resolution analysis. We compared models with different input resolution on DAVIS 2017 validation set

Input resolution We adjust the input resolution of the model as shown in Table 8, from which we can see that our method achieves better performance with a larger input size. Our proposed method with half input resolution runs faster (10.4 fps improvement) while the performance drops 4.0% (\( J \& F\)). Therefore, we compare our model with input resolution 480p to other state-of-the-art methods.

4.5 Exploration of efficiency improvement

This section aims to discuss how to strike a balance between segmentation accuracy and inference speed. As shown in Tables 6 and 8, our model can achieve a trade-off between latency and accuracy by replacing with a light backbone or reducing the input size to half of the input resolution. Besides, we observed that the transformer causes a large portion of the latency. Therefore, another way of improving the efficiency is to reduce the computation of the transformer. Firstly, we provide a slim version with only half the number of the original model’s transformer encoder/decoder layers. The slim version achieves 11.1 fps with 82.8% (\( J \& F\)) on DAVIS17-val, as shown in Table 9.

Moreover, since the computational complexity of the self-attention mechanism is proportional to the square of the length of the input sequences, we further propose to compress the redundant reference representations for efficiency improvement. Inspired by [21, 42], we modify the \(\textrm{PAM}\) module in SwiftNet [42] as a sequence organizer to compress the redundant elements in reference sequences. Note that the \(\textrm{PAM}\) module is originally used to compress redundant pixels of the key and value embeddings in the memory bank. We briefly describe the compression process here based on SwiftNet. In the online inference phase, if frame \(\textbf{I}_{t}\) with the predicted masks \(\textbf{M}_{t}\) is chosen as a reference frame, for each element in the representations \(\textbf{X}_{t}\) of \((\textbf{I}_{t}, \textbf{M}_{t})\), the sequence organizer finds its most relevant element in representations \(\textbf{X}_{R, t-1}\) of the reference set at timestamp \(t\mathrm {-1}\) via dot-product and computes the cosine similarity as the feature score. Then elements in \(\textbf{X}_{t}\) are sorted by the feature scores. Finally, the top \(\beta \) (\(\beta \) is experimentally set to 10%) percents elements of \(\textbf{X}_{t}\) is selected and added to the reference set. From Table 9, we can see that the sequence organizer improves the inference speed by over 50% with a slight accuracy drop (0.2% in \( J \& F \)).

Table 9 Exploration of efficiency improvement. Models are tested on DAVIS17-val

5 Conclusions

This paper proposes a new framework for video object segmentation (VOS), a compact and unified single-extractor pipeline with robust spatial and temporal interaction ability using a vision transformer. Specifically, we propose a dynamic feature extractor to encode the reference sets and query frames in a unified way, dramatically slimming the existing VOS framework while maintaining the performance and architecture’s flexibility. Moreover, we attach the vision transformer to the dynamic feature extractor to model the spatial and temporal relationships among reference sets and query frames simultaneously, providing discriminating spatial-temporal features for segmenting. We implement an effective and modularized framework with the extractor, transformer, and segmentation head. Moreover, our model achieves top performance on several benchmarks, demonstrating its potential and effectiveness. Also, we explore how to strike a balance between the segmentation quality and inference speed. In the future, we will further improve our model’s efficiency by designing a better sequence organizer and applying it after each transformer encoder layer.