1 Introduction

Object segmentation is considered a labeling problem aiming to separate foreground from background regions. Video instance segmentation, which is higher-level and more challenging than object segmentation, aims to label each video frame pixel to instances or the background region and then assign consistent IDs to these instances over the video sequence. Object/instance segmentation in videos is beneficial in a wide range of practical applications, i.e.,autonomous vehicle [1], action recognition [21], video summarization [30], object tracking [66], scene understanding [70], and video annotation [28].

Fig. 1
figure 1

Examples of results obtained by our proposed method. From left to right: the first video frame with the ground-truth label followed by results of our method on next frames

This paper focuses on semi-supervised video instance segmentation [46], which targets certain instances whose ground-truth mask for the first video frame is given. DAVIS Challenge [46] promotes the development of this task. The benchmark dataset of this challenge consists of many pitfalls such as rapid motion, distractors, smaller objects, fine structures, occlusions, large deformations, complex object interactions, and so on. Figure 1 shows some exemplary results of our proposed method on the DAVIS Test-Challenge dataset [46].

To address the challenges of the given problem, tracking and re-identification methods are adopted and jointly integrated into segmentation models to keep the consistency of targeted instances over the entire video sequence [17, 22, 31, 32]. However, existing works usually fail to follow and segment targeted instances due to cannot cover all various contexts in the video. We argue that context information is essential for semantic segmentation to reduce ambiguous instances and obtain robust results. Therefore, this work aims to leverage the context information to improve the performance of video instance segmentation. Inspired by the idea of “you should look twice” [42, 43] in the task of object detection, we propose a three-pass guided segmentation framework, namely contextual guided segmentation (CGS), to tackle the problem of semi-supervised video instance segmentation. Our proposed method consists of two key ideas as below.

First, we exploit variation in the video and propose various contextual segmentation strategies adapting to contexts, i.e.,the category and visual properties of an instance. To select the appropriate scheme, we propose a novel Instance Re-Identification Flow (IRIF) to propagate the initial mask of an instance to other frames and analyze the visual properties of segmented regions. Multiple contextual segmentation schemes are also introduced to adapt the contextual properties of each instance. For human instances, we develop skeleton-guided segmentation. For non-human instances, we train FCNs from our synthesized dataset for nearly rigid instances with similar background scenes. Instance segmentation detectors are utilized to handle deformable non-human instances in known categories. Results from our IRIF are treated as the baseline scheme for other cases.

Second, to segment an instance in a region of interest (ROI), we propose novel guided fined-grained segmentation based on attention for performance improvement. We transform a regular rectangular ROI to a non-rectangular ROI by blending attention inferred from neighbor frames to eliminate complex background inside the ROI. We also propose bidirectional propagation strategies to construct adaptive attention for guided segmentation. Forward propagation strategy can correct missing segmentation due to dense objects in a ROI. Meanwhile, a backward propagation strategy can recover missing instances due to fast motion, occlusion, or heavy deformation.

The DAVIS Challenges 2017–2019 results indicate that our method is competitive among the top-performing submissions. Our early results were preliminarily listed on DAVIS 2017 Challenge [25], DAVIS 2018 Challenge [59], and DAVIS 2019 Challenge [58]. In this paper, we provide the full details of our proposed framework. Our contributions are as follows.

  • We propose contextual guided segmentation (CGS) framework with three segmentation passes to exploit various contexts in video instance segmentation. Our proposed method achieved the 3rd ranking consistently in the DAVIS Challenges 2017–2019.

  • We propose Instance Re-Identification Flow (IRIF) to extract contextual properties of each instance by propagating its preview mask from the current frame to coming frames.

  • We introduce multiple contextual segmentation schemes to adapt the contextual properties of each instance.

  • We propose bidirectional propagation strategies for guided fined-grained segmentation in non-rectangular ROIs. Our proposed guided segmentation outperforms the standard segmentation, which is mostly applied in rectangular ROIs.

  • To blend instance masks into a unique result, we introduce a merging process based on their depth values together with human and non-human object interaction and rare instance priority.

  • We construct Wonderland Data to increase the number of training data for one-shot learning. Our proposed augmentation approach also can be utilized for different problems.

The remainder of this paper is organized as follows. In Sect. 2, we briefly review the related work. Next, our proposed methods are presented in Sect. 3. Experimental results are then reported and discussed in Sect. 4. Finally, Sect. 5 concludes and paves the way for future work.

2 Related work

2.1 One-shot learning

Data augmentation is essential to deal with one-shot learning [2], which aims to train a deep network with only a given first video frame. Caelles et al. [2] introduced the first simple data augmentation strategy such as random crop, random scale, vertical flip, random changes in brightness, saturation, and contrast of the given first frame. Khoreva et al. [22] later introduce Lucid Dreaming [22] to synthesize the foreground changes by rigid and non-rigid transformation with a small extent, and synthesize the background changes using affine deformations with limited appearance variations. The given first frame with ground truth is augmented with Lucid Dreaming to generate more training data with different viewpoints, leading to much improvement of training networks. Hence, augmented data by Lucid Dreaming, called Lucid Data, has become common for one-shot learning. However, Lucid Data cannot deal with different backgrounds caused by objects’ motion or camera view changes. Guo et al. [17] changed the background of the first video frame by images with pure background crawled randomly from the Internet by Google, namely Online Data. However, Online Data is unstable because of randomly crawled from the Internet without considering the content of the video. Meanwhile, our Wonderland Data is filtered out from large-scale scene data to choose the most similar scenes with the video.

Khoreva et al. [22] trained appearance-based and motion-based models with Lucid Data [22]. Shaban et al. [54] learned video segments by bootstrapping them from temporally consistent object proposals, which are first spatially trained on Lucid Data [22] and then incorporated a semi-Markov pixel-level motion model to form spatiotemporal object proposals. Luiten et al. [38] first trained DeepLab3+ [8] on a combination of standard datasets and then fine-tuned the network on Lucid Data [22] of each video to form a strong network to segment instance inside ROI. Li et al. [31] trained online re-identification network, which is the original Region Proposal Network of Mask R-CNN, and a recurrent mask propagation network on Lucid Data [22]. Xu [71] proposed a spatiotemporal CNN in which the spatial segmentation branch is fine-tuned online on Lucid Data of each sequence while the temporal coherence branch is trained offline on the entire dataset. Models are not only fine-tuned offline on Lucid Data [22] of the first frame but also can be updated online while processing the video [62]. Mask R-CNN is fine-tuned on Lucid Data [74] or Online Data [17] to adapt proposals to the video.

Fig. 2
figure 2

Overview of our contextual guided segmentation (CGS) framework

2.2 Temporal connection mining

This approach aims to perform instance tracking, propagation, and re-identification, where each instance is detected and re-identified through frames [32]. Li et al. [32] iteratively propagated masks via flow warping and re-identified instances via adaptive matching to retrieve missing ones. Luiten et al. [38] first segmented multiple object proposals in the entire video and then selected and linked these proposals over time using a re-identification feature embedding vector for each proposal. Re-identification feature embedding vectors are computed using a triplet-loss-based re-identification embedding network. Li et al. [31] jointed re-identification and attention-based recurrent temporal propagation into a unified framework to retrieve missing objects despite their large appearance changes. Guo et al. [17] first extracted possible mask proposals in each frame and then joined tracking and re-identification to filter and rank proposals to merge the highest confident proposals. Xu et al. [74] adapted a multiple hypotheses tracking method to build up a bounding box proposal tracking tree for different objects, then propagate masks, and finally merged mask proposals from the tracking tree. Wang et al. [66] used fully convolutional Siamese trackers to produce class-agnostic binary segmentation masks of the target objects. Voigtlaender et al. [61] used a semantic pixel-wise embedding together with a global and a local matching mechanism to transfer information from the first frame and from the previous frame of the video to the current frame, which is used as internal guidance for segmentation. Jonathon et al. [39] used a Siamese architecture to detect and track multiple objects and then performed segmentation inside the detected bounding boxes. Tran et al. [57] propagated masks with reference to multiple extra samples through a memory reference pool.

2.3 End-to-end temporal learning

This approach directly learns temporal information in a video through deep learning architectures such as LSTM, guided attention, or memory networks. Some methods combine feature maps from different video frames by correlation matching [61] or non-local matching [44]. Guo et al.   [16] integrated STM [44] into DeepLabv3+ [8] to concatenate low-level features in mask decoder. Andreas et al. [48] implemented a memory network to add semantic information about the target object from a previous frame to the refinement stage, complementing the predictions provided by the target appearance model. Zhang et al. [78] developed a spatial constraint module that takes the previous prediction to generate a spatial prior for the current frame, helping to disambiguate appearance confusion and eliminate false predictions. Fiaz et al. [15] introduced a guided feature learning without model update algorithm for directional deep appearance learning. Liu et al. [35] integrated multilevel backbone into memory network to generate higher spatial resolution features. Le et al. [64] leveraged existing memory-based models and enhanced their capability by adding pre-processing and post-processing steps. Xie et al. [69] integrated depth maps from a video sequence into STM [44] to alleviate the ambiguity of objects with similar appearances. Seong et al. [53] developed a kernelized memory network and used the Hide-and-Seek strategy training to handle occlusions and segment boundary extraction. Yang et al. [77] combined collaborative foreground–background integration with multi-scale matching to be robust to various object scales.

3 Proposed method

3.1 Overview

Figure 2 illustrates CGS with three passes: preview segmentation for context evaluation, contextual segmentation, guided segmentation based on propagation. In particular, in the first pass, we propose Instance Re-Identification Flow (IRIF) to generate the preview mask sequence and extract different contextual properties from each instance. In the second pass, we introduce multiple segmentation schemes corresponding to extracted properties. In the third pass, we develop fined-grained segmentation based on guided propagation. We remark that each instance is processed independently over frames of a video sequence. Finally, instance masks are then blended with reference to depth information, human and non-human instance interaction, and rare instance priority.

Fig. 3
figure 3

The flowchart of Instance Re-Identification Flow (IRIF) component. The segmentation performed on the current frame is based on the history information of the previous frames. The segmentation result of the current frame is further fed to the process of the coming frame

3.2 Preview Segmentation

Figure 3 illustrates the flowchart of Instance Re-Identification Flow (IRIF) for preview segmentation. The segmentation performed on the current frame is based on the history information of the previous frames. The segmentation result of the current frame is further fed to the process of the coming frame.

We remark that in this component, we consider two types of instance, i.e.,  human and non-human, to treat each instance in different ways. Given the first frame with its ground-truth label, we extract the bounding box for each instance and then perform human/non-human classification for all instances using Mask R-CNN [18].

3.2.1 Instance localization and tracking

For each video frame, we localize and track instances in a re-identification manner. Note that we expand the bounding box to 10% to well capture the whole area of the object instances. For human objects, we employ person search [68] by detecting person by using Faster R-CNN and then extracting person re-identification feature for all detected person region. On the other hand, DeepFlow [67] and Deformable Part Models (DPM) [14] are utilized to detect and track non-human objects.

3.2.2 Adaptive online learning for instance segmentation

For each instance, to identify each pixel as foreground (instance) or background, we utilize multiple binary SVM classifiers [6] which is learned from the appearance of the previous n frames with sampling step size \(\delta \), where n and \(\delta \) are set as 8 and 2, respectively. Note that our multiple binary SVM classifiers are implemented for history reference with several unary instances, e.g., saliency [36], CNN features [23], location of the bounding box, and color, to segment each instance within its tracked bounding box in each frame. We only update the SVM model if the size of one instance significantly changes. We then utilize GrabCut [49] for each instance to separate it from the background. After this step, each pixel is assigned with the instance ID.

Specifically for human instance, in case the instance is missing and re-appears in the next couple of frames, we adopt the state-of-the-art image parser, Pyramid Scene Parsing (PSPNet) [81] with the pre-trained model on PASCAL VOC dataset [13]. The re-identification results from PSPNet are blended into our segmentation outcomes.

3.2.3 Contextual property extraction

This component aims to determine the context of an instance so that we can apply an appropriate segmentation scheme for that instance. The context can be any observable properties that may affect the strategy to extract the mask of an instance in frames efficiently. In this work, we consider the following three attributes of an instance as its context: human or non-human, known or unknown category, rigid or deformable.

The category of an instance, such as person, car, and dog, can be directly inferred from its initial mask using pre-trained Mask R-CNN [18] on the MS-COCO dataset.

To evaluate if an instance is rigid or deformable, we analyze the preview sequence of instance masks in the first \(n_{Preview}\) frames. If there exists a homography matrix to transform the instance from the first frame to another frame for most frames in the first \(n_{Preview}\) frames, we consider the instance to be rigid.

3.3 Contextual segmentation

Each instance is segmented in different appropriate ways in this contextual segmentation, adapting to its extracted contextual properties (i.e., human/non-human, rigid/deformable, known/unknown category).

3.3.1 Human instance segmentation

We employ Mask R-CNN [18], pre-trained on the MS-COCO dataset [34], to extract human segments. However, the results of Mask R-CNN may be affected by occlusion or unusual human pose.

To overcome this issue, we develop skeleton-guided segmentation. We use the skeletons from OpenPose [4] for reference to control and refine human instance segmentation. For a human instance with an unusual pose that Mask R-CNN cannot recognize, we dilate the skeleton to obtain a skeleton-guided region, i.e.,an image with only the region containing the complete human instance. We then apply Mask R-CNN on a skeleton-guided region. By eliminating unrelated content, Mask R-CNN has a higher chance to extract human instance segment correctly (see Fig. 4). To preserve the inter-frame mask consistency, we use object flow [60] to correct and refine the result across frames.

Fig. 4
figure 4

Skeleton-guided segmentation for unusual pose

3.3.2 Rigid non-human instance segmentation

For this type of instance, our objective is to accurately extract such instances from different backgrounds in the same scene category with the initial frame. Our method to process each instance is as follows. First, we synthesize images from the first frame of a video sequence, resulting in Wonderland Data. Second, to segment instances inside bounding boxes, we train DeepLab2 [7] and OSVOS [2] on our synthesized Wonderland Data.

Wonderland Data Generation Differently from existing work, we exploit various contextual properties from instances. After that, multiple segmentation schemes are performed for each instance, adapting to its extracted contextual properties. Inspired by Lucid Data [22], we introduce new augmented data, namely Wonderland Data. To generate visual variations of the initial mask, we apply both affine and non-rigid deformations, together with illumination changes, on the mask. We also replace the background with most similar scenes filtered out from a large-scale Places365 dataset [82] to preserve the semantics of the image. In this way, we can increase more training samples than Lucid Data (10,000 images for each video, in comparing with 2,500 images of Lucid Data) to deal with one-shot learning.

Fig. 5
figure 5

Wonderland data generation

Fig. 6
figure 6

Augmented data generated by different methods. From left to right: the original video frames with overlaid ground truth, followed by corresponding Lucid Data [22] and our proposed Wonderland Data in this order

Figure 5 illustrates our proposed Wonderland Data generation. In this work, from a pair of an input image and a mask, we generate 10, 000 different pairs of synthesized images and masks. The Wonderland Data is published on our website.Footnote 1 We collect scene photos from the training set of the Places365 dataset [82], which has about 8 million images divided into 365 scene categories. We manually discard artificial scenes, use only 22 natural scene categories with 592k images. For each image, we extract a feature at the last layer of DenseNet-161 [20], which was pre-trained on the Places365 dataset [82]. This feature is used to build a hierarchical k-mean search for each category independently. We assume that each node has M images, and a leaf node has maximum L images. To cluster images at a node, we propose to use K-mean algorithm with \(K=\min (M \backslash L, T)\). In this work, we empirically set \(L=200\) and \(T=200\) to speed up clustering.

We classify an input image into the corresponding category, using the pre-trained DenseNet-161 on the Places365 challenge dataset. We also extract a channel feature at the last layer of the same network. After that, we search leaf nodes by comparing the Euclidean distance between the feature of an input image and the center of clusters. To search N images, we randomly choose \(80\%\) number of images of the nearest leaf node and \(70\%, 60\%, 50\%\), etc. number of images of next leaf nodes, respectively.

We also extract the object mask from the input image, then transform the object and searched scenes independently, similarly to [22]. In more detail, we use affine transformation (e.g.,translation, rotation, and scale) and non-rigid deformations, together with illumination changes. Figure 6 shows examples of Lucid Data and our Wonderland Data.

Fig. 7
figure 7

The flowchart of our network training process

Fig. 8
figure 8

Visualization of guided non-rectangular ROI

Fig. 9
figure 9

Visualization of forward and backward propagation

Network Training Figure 7 shows our training process, including domain-based training and object-based training. In domain-based training, we fine-tune pre-trained networks (i.e.,  DeepLab2 [7] pre-trained on COCO-Stuff dataset [3] and OSVOS [2] pre-trained on ImageNet dataset [50]) on the DAVIS training data for domain transformation. In object-based training: we fine-tune networks on the ground-truth mask of each instance of each video. We remark that we use only the first frame of videos and apply the proposed Wonderland Data generation method for these images.

3.3.3 Deformable non-human instance segmentation

For this instance type, we categorize instances into two groups, namely, known and unknown categories. For the known categories, i.e., already listed in MS-COCO dataset [34], we simply adopt Mask R-CNN to retrieve the instance segments. We directly obtain the preview results from our IRIF component for the unknown categories since it can handle arbitrary object categories.

3.4 Guided segmentation

Traditional Fully Convolutional Networks (FCNs) consider the entire rectangular region of interest (ROI) as the input to segment objects inside the ROI. This can lead to incorrect boundary segmentation due to the complex background and concave hull of the object. To overcome this limitation, we aim to transform a rectangular ROI to a non-rectangular ROI across the object boundary to eliminate the complex background inside the ROI (see Fig. 8). In particular, we utilize referral information from extra frames to identify the shape of the instance of interest inside the ROI of the current frame. We propose to apply guided attention to construct the non-rectangular ROI and then perform fine-grained segmentation on this guided non-rectangular ROI.

3.4.1 Bidirectional propagation

In particular, we propose bidirectional strategies to construct adaptive attention for guided segmentation. Particularly, initial segments from neighbor frames are used as references for segmentation at the current frame. Attention is computed in two strategies sequentially, i.e., forward propagation and back-propagation, in specific ways adapting the context. Forward propagation strategy, where attention is referenced from initial segments of previous frames, can correct excessed segmentation due to dense objects in a ROI (cf. Fig. 9a). Meanwhile, the back-propagation strategy, where attention is referenced from initial segments of next frames, can recover missing instances due to fast motion, occlusion, or heavy deformation (size changing from tiny to large or vice versa) (cf. Fig. 9b).

3.4.2 Guided non-rectangular ROI construction

To construct a guided non-rectangular ROI, we expand the mask of the interest instance at neighbor frames and then transfer and combine them at the current frame. This guarantees that the ROI can cover the entire interest instance. We do not apply mask propagation to avoid inaccurate flow warping as well as reducing the complexity of computation. Then, we create a smooth transition region (by applying a blurred mask to remove background) for the guided ROI to avoid a clear border between the ROI and background. It is essential to make the segmentation method focus on the interest instance and avoid inaccurate segmentation due to a clear border. We remark that the range of boundary expansion and transition smooth is computed based on the intensity of movement of the instance. Both propagation strategies are performed adaptively if initial segments of the interest instance at the current frame are much different (in appearance or size) from those at neighbor frames or the instance re-appears. On the other hand, we only refine the interest instance at the current frame to save the computational cost.

3.4.3 Fine-grained segmentation

We use Deep Grabcut [72] and Mask R-CNN [18] for fine-grained segmentation in guided non-rectangular ROIs. Inspired by Luiten et.al. [38], we train DeepLab3+ [8] based on Xception-65 [10] backbone on MS-COCO [34] and Mapillary [40] datasets to enhance the network generalization. For Mask R-CNN, we directly use a pre-trained model on MS-COCO [34] dataset.

3.5 Refinement and merging

Through preliminary results, we observe that the initial segmentation is not smooth enough. Therefore, we refine instance masks to improve segmentation quality, using rare instance attention and boundary snapping.

3.5.1 Rare instance attention refinement

We further refine the results by considering the rare instances. We observe that rare objects are shrunk due to larger objects. To identify rare object instances, we compute each object instance mask percentage in terms of area (provided in the first frame). Instances with a size smaller than 5% the total size of tracking objects are considered rare ones. We assume that a smaller object tends to be small in the whole video. Next, we recover rare object instances by transferring the results produced by the foreground probability obtained from the binary SVM classifier on each object instance.

3.5.2 Boundary snapping refinement

We also adopt boundary snapping [2] to further refine object shapes. In particular, we extract the saliency [36] and the contour [76] from the video frame. The salient pixels close to the contour are snapped.

3.5.3 Topological order estimation for instance merging

It is essential to determine the topology relationship (in terms of z-order) between multiple instances to sequentially combine corresponding masks of different instances into the final result. We here merge instances based on human and non-human instance interaction, depth values, and rare instance priority heuristics in this order as follows:

  • Human and non-human instance interaction We define interaction heuristics as follow: transportation instances (such as horse, bike, motor, surfboard, and skateboard) are the farthest from the camera; human instance have the middle distance to the camera; and small non-human instances which can be held, bring, touch, etc. are the nearest from the camera. Interacted small non-human instances are localized at the human hand’s position using OpenPose [4].

  • Depth values We first estimate pixel-wise depth values of the video frame, using DCNF-FCSP [37], and then take the average value for each instance.

  • Rare instance priority We notice that rare instances are always the nearest ones from the camera.

4 Experimental results

4.1 Dataset benchmark and metrics

We participated in the DAVIS Challenges 2017–2019, Semi-Supervised TrackFootnote 2Footnote 3Footnote 4 and evaluated our methods on the DAVIS Test-Challenge dataset. The dataset consists of 150 sequences, totaling 10, 459 annotated frames and 376 instances. There are a total of 30 video sequences for testing, and their ground truth is not publicly available. Submissions were made through the CodaLab site of the challenge.Footnote 5 This dataset is challenging due to multiple object instances with more distractors, i.e., smaller instances and fine structures, more occlusions, and fast motion.

For the evaluation metrics, per-instance measures are used as described in [45]: Region Jaccard (J) and Boundary F measure (F). The overall measures are computed as the mean between J and F, and both are averaged over all objects.

Table 1 Top global ranking results in the DAVIS Challenges 2017–2019. The best results are marked in italic
Fig. 10
figure 10

Visualization results on the DAVIS Test-Challenge dataset. From top to bottom: the first video frame with the ground-truth label followed by results of our proposed methods in preview segmentation [26], contextual segmentation [59], and guided segmentation [58]. The ground truth of the certain video frame is not publicly available. Our CGS results significantly track and segment the instances of interest as annotated in the first frame

4.2 Results on DAVIS challenges 2017–2019

4.2.1 DAVIS 2017 challenge

Due to the time limit, we submitted the proposed IRIF component in the DAVIS 2017 Challenge and achieved 3rd place out of 22 team submissions in this challenge. As shown in Table 1, our proposed IRIF achieves very promising results in the DAVIS 2017 Challenge, namely, 0.615, 0.662, and 0.638 in terms of region similarity (Jaccard index), contour accuracy (F measure), and global score, respectively. Our results highly indicate that our method is competitive among the state-of-the-art methods in this dataset. Our method maintains the performance as frames evolve, as seen via the best performance in terms of J decay and F decay among the leading submissions in 2017.

4.2.2 DAVIS 2018 challenge

We also had another submission of CIS framework to the DAVIS 2018 Challenge and achieved 6th place out of 41 team submissions in this challenge. Table 1 shows that our CIS achieves promising results, namely, 64.1%, 68.6%, and 66.3% in terms of region similarity (Jaccard index), contour accuracy (F measure), and global score, respectively. Our method also maintains the best stable performance in terms of J decay and F decay among the leading submissions in 2018.

4.2.3 DAVIS 2019 challenge

As shown in Table 1, we obtained very competitive results. Our proposed CGS achieved 0.724, 0.784, and 0.754 in terms of region similarity (J), contour accuracy (F), and global score, respectively. Our method achieved the best performance in Decay and Recall of all metrics consistently. Furthermore, we note that our CGS is in top 3 over 4 teams achieving 0.75 in terms of global score in all 3 years.

4.2.4 Ablation study

Table 2 shows the results of our proposed framework with different settings. Our proposed CGS (using all three passes) outperforms using only two passes [59] or a pass [26]. This highlights the significant contribution of the second pass and the third pass, which are the multiple contextual segmentation schemes, and guided instance segmentation, respectively. Particularly, contextual segmentation can improve the performance up to 2.5%. Meanwhile, guided segmentation improves contextual segmentation up to 9.1% in the global score.

Figure 10 visualizes segmentation results. From top row to bottom row, we can observe the first video frame and a triple of processed video frames of our proposed methods in preview segmentation [26], contextual segmentation [59], and guided segmentation [58]. Our final CGS results surpass the performance of others and successfully track and segment the key instances. Our framework can even handle camouflaged instances, small instances, and occluded instances.

Table 2 The performance of different components in our method on the DAVIS Test-Challenge dataset. PS, CS, and GS stand for preview segmentation, contextual segmentation, and guided segmentation, respectively

5 Conclusion

In this paper, we propose the novel CGS framework for semi-supervised instance segmentation in videos with three segmentation passes. In the first pass, we develop the novel IRIF for preview instance segmentation and extract contextual information. In the second pass, we introduce multiple contextual segmentation schemes to deal with different instance types, such as human/non-human rigid/non-rigid instances in known/unknown object categories. In the final pass, we propose a novel guided fined-grained segmentation based on attention to eliminate complex background inside the region of interest for performance improvement.

Our proposed methods achieve competitive results among the leading submissions in the DAVIS Challenges consistently, i.e.,3rd place, 6th place, and 3rd place in 2017, 2018, and 2019, respectively. Our full framework CGS is in the top 3 over 4 teams achieving 0.75 in terms of global score in all 3 years. Our method also maintains the best stable and recall performance among the leading submissions.

In the future, we plan to consider modeling the semantic relationship among object instances in the segmentation process. We will also investigate Capsule-inspired [19, 51, 52, 79], and attention-inspired [5, 11, 12, 29] network architectures for better segmentation performance. We also aim to extend our work to camouflage analysis [24, 27, 75] in the near future.