Abstract
In this paper, we propose contextual guided segmentation (CGS) framework for video instance segmentation in three passes. In the first pass, i.e.,preview segmentation, we propose Instance Re-Identification Flow to estimate main properties of each instance (i.e., human/non-human, rigid/deformable, known/unknown category) by propagating its preview mask to other frames. In the second pass, i.e.,contextual segmentation, we introduce multiple contextual segmentation schemes. For human instance, we develop skeleton-guided segmentation in a frame along with object flow to correct and refine the result across frames. For non-human instance, if the instance has a wide variation in appearance and belongs to known categories (which can be inferred from the initial mask), we adopt instance segmentation. If the non-human instance is nearly rigid, we train FCNs on synthesized images from the first frame of a video sequence. In the final pass, i.e.,guided segmentation, we develop a novel fined-grained segmentation method on non-rectangular regions of interest (ROIs). The natural-shaped ROI is generated by applying guided attention from the neighbor frames of the current one to reduce the ambiguity in the segmentation of different overlapping instances. Forward mask propagation is followed by backward mask propagation to further restore missing instance fragments due to re-appeared instances, fast motion, occlusion, or heavy deformation. Finally, instances in each frame are merged based on their depth values, together with human and non-human object interaction and rare instance priority. Experiments conducted on the DAVIS Test-Challenge dataset demonstrate the effectiveness of our proposed framework. We achieved the 3rd consistently in the DAVIS Challenges 2017–2019 with 75.4%, 72.4%, and 78.4% in terms of global score, region similarity, and contour accuracy, respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Object segmentation is considered a labeling problem aiming to separate foreground from background regions. Video instance segmentation, which is higher-level and more challenging than object segmentation, aims to label each video frame pixel to instances or the background region and then assign consistent IDs to these instances over the video sequence. Object/instance segmentation in videos is beneficial in a wide range of practical applications, i.e.,autonomous vehicle [1], action recognition [21], video summarization [30], object tracking [66], scene understanding [70], and video annotation [28].
This paper focuses on semi-supervised video instance segmentation [46], which targets certain instances whose ground-truth mask for the first video frame is given. DAVIS Challenge [46] promotes the development of this task. The benchmark dataset of this challenge consists of many pitfalls such as rapid motion, distractors, smaller objects, fine structures, occlusions, large deformations, complex object interactions, and so on. Figure 1 shows some exemplary results of our proposed method on the DAVIS Test-Challenge dataset [46].
To address the challenges of the given problem, tracking and re-identification methods are adopted and jointly integrated into segmentation models to keep the consistency of targeted instances over the entire video sequence [17, 22, 31, 32]. However, existing works usually fail to follow and segment targeted instances due to cannot cover all various contexts in the video. We argue that context information is essential for semantic segmentation to reduce ambiguous instances and obtain robust results. Therefore, this work aims to leverage the context information to improve the performance of video instance segmentation. Inspired by the idea of “you should look twice” [42, 43] in the task of object detection, we propose a three-pass guided segmentation framework, namely contextual guided segmentation (CGS), to tackle the problem of semi-supervised video instance segmentation. Our proposed method consists of two key ideas as below.
First, we exploit variation in the video and propose various contextual segmentation strategies adapting to contexts, i.e.,the category and visual properties of an instance. To select the appropriate scheme, we propose a novel Instance Re-Identification Flow (IRIF) to propagate the initial mask of an instance to other frames and analyze the visual properties of segmented regions. Multiple contextual segmentation schemes are also introduced to adapt the contextual properties of each instance. For human instances, we develop skeleton-guided segmentation. For non-human instances, we train FCNs from our synthesized dataset for nearly rigid instances with similar background scenes. Instance segmentation detectors are utilized to handle deformable non-human instances in known categories. Results from our IRIF are treated as the baseline scheme for other cases.
Second, to segment an instance in a region of interest (ROI), we propose novel guided fined-grained segmentation based on attention for performance improvement. We transform a regular rectangular ROI to a non-rectangular ROI by blending attention inferred from neighbor frames to eliminate complex background inside the ROI. We also propose bidirectional propagation strategies to construct adaptive attention for guided segmentation. Forward propagation strategy can correct missing segmentation due to dense objects in a ROI. Meanwhile, a backward propagation strategy can recover missing instances due to fast motion, occlusion, or heavy deformation.
The DAVIS Challenges 2017–2019 results indicate that our method is competitive among the top-performing submissions. Our early results were preliminarily listed on DAVIS 2017 Challenge [25], DAVIS 2018 Challenge [59], and DAVIS 2019 Challenge [58]. In this paper, we provide the full details of our proposed framework. Our contributions are as follows.
-
We propose contextual guided segmentation (CGS) framework with three segmentation passes to exploit various contexts in video instance segmentation. Our proposed method achieved the 3rd ranking consistently in the DAVIS Challenges 2017–2019.
-
We propose Instance Re-Identification Flow (IRIF) to extract contextual properties of each instance by propagating its preview mask from the current frame to coming frames.
-
We introduce multiple contextual segmentation schemes to adapt the contextual properties of each instance.
-
We propose bidirectional propagation strategies for guided fined-grained segmentation in non-rectangular ROIs. Our proposed guided segmentation outperforms the standard segmentation, which is mostly applied in rectangular ROIs.
-
To blend instance masks into a unique result, we introduce a merging process based on their depth values together with human and non-human object interaction and rare instance priority.
-
We construct Wonderland Data to increase the number of training data for one-shot learning. Our proposed augmentation approach also can be utilized for different problems.
The remainder of this paper is organized as follows. In Sect. 2, we briefly review the related work. Next, our proposed methods are presented in Sect. 3. Experimental results are then reported and discussed in Sect. 4. Finally, Sect. 5 concludes and paves the way for future work.
2 Related work
2.1 One-shot learning
Data augmentation is essential to deal with one-shot learning [2], which aims to train a deep network with only a given first video frame. Caelles et al. [2] introduced the first simple data augmentation strategy such as random crop, random scale, vertical flip, random changes in brightness, saturation, and contrast of the given first frame. Khoreva et al. [22] later introduce Lucid Dreaming [22] to synthesize the foreground changes by rigid and non-rigid transformation with a small extent, and synthesize the background changes using affine deformations with limited appearance variations. The given first frame with ground truth is augmented with Lucid Dreaming to generate more training data with different viewpoints, leading to much improvement of training networks. Hence, augmented data by Lucid Dreaming, called Lucid Data, has become common for one-shot learning. However, Lucid Data cannot deal with different backgrounds caused by objects’ motion or camera view changes. Guo et al. [17] changed the background of the first video frame by images with pure background crawled randomly from the Internet by Google, namely Online Data. However, Online Data is unstable because of randomly crawled from the Internet without considering the content of the video. Meanwhile, our Wonderland Data is filtered out from large-scale scene data to choose the most similar scenes with the video.
Khoreva et al. [22] trained appearance-based and motion-based models with Lucid Data [22]. Shaban et al. [54] learned video segments by bootstrapping them from temporally consistent object proposals, which are first spatially trained on Lucid Data [22] and then incorporated a semi-Markov pixel-level motion model to form spatiotemporal object proposals. Luiten et al. [38] first trained DeepLab3+ [8] on a combination of standard datasets and then fine-tuned the network on Lucid Data [22] of each video to form a strong network to segment instance inside ROI. Li et al. [31] trained online re-identification network, which is the original Region Proposal Network of Mask R-CNN, and a recurrent mask propagation network on Lucid Data [22]. Xu [71] proposed a spatiotemporal CNN in which the spatial segmentation branch is fine-tuned online on Lucid Data of each sequence while the temporal coherence branch is trained offline on the entire dataset. Models are not only fine-tuned offline on Lucid Data [22] of the first frame but also can be updated online while processing the video [62]. Mask R-CNN is fine-tuned on Lucid Data [74] or Online Data [17] to adapt proposals to the video.
2.2 Temporal connection mining
This approach aims to perform instance tracking, propagation, and re-identification, where each instance is detected and re-identified through frames [32]. Li et al. [32] iteratively propagated masks via flow warping and re-identified instances via adaptive matching to retrieve missing ones. Luiten et al. [38] first segmented multiple object proposals in the entire video and then selected and linked these proposals over time using a re-identification feature embedding vector for each proposal. Re-identification feature embedding vectors are computed using a triplet-loss-based re-identification embedding network. Li et al. [31] jointed re-identification and attention-based recurrent temporal propagation into a unified framework to retrieve missing objects despite their large appearance changes. Guo et al. [17] first extracted possible mask proposals in each frame and then joined tracking and re-identification to filter and rank proposals to merge the highest confident proposals. Xu et al. [74] adapted a multiple hypotheses tracking method to build up a bounding box proposal tracking tree for different objects, then propagate masks, and finally merged mask proposals from the tracking tree. Wang et al. [66] used fully convolutional Siamese trackers to produce class-agnostic binary segmentation masks of the target objects. Voigtlaender et al. [61] used a semantic pixel-wise embedding together with a global and a local matching mechanism to transfer information from the first frame and from the previous frame of the video to the current frame, which is used as internal guidance for segmentation. Jonathon et al. [39] used a Siamese architecture to detect and track multiple objects and then performed segmentation inside the detected bounding boxes. Tran et al. [57] propagated masks with reference to multiple extra samples through a memory reference pool.
2.3 End-to-end temporal learning
This approach directly learns temporal information in a video through deep learning architectures such as LSTM, guided attention, or memory networks. Some methods combine feature maps from different video frames by correlation matching [61] or non-local matching [44]. Guo et al. [16] integrated STM [44] into DeepLabv3+ [8] to concatenate low-level features in mask decoder. Andreas et al. [48] implemented a memory network to add semantic information about the target object from a previous frame to the refinement stage, complementing the predictions provided by the target appearance model. Zhang et al. [78] developed a spatial constraint module that takes the previous prediction to generate a spatial prior for the current frame, helping to disambiguate appearance confusion and eliminate false predictions. Fiaz et al. [15] introduced a guided feature learning without model update algorithm for directional deep appearance learning. Liu et al. [35] integrated multilevel backbone into memory network to generate higher spatial resolution features. Le et al. [64] leveraged existing memory-based models and enhanced their capability by adding pre-processing and post-processing steps. Xie et al. [69] integrated depth maps from a video sequence into STM [44] to alleviate the ambiguity of objects with similar appearances. Seong et al. [53] developed a kernelized memory network and used the Hide-and-Seek strategy training to handle occlusions and segment boundary extraction. Yang et al. [77] combined collaborative foreground–background integration with multi-scale matching to be robust to various object scales.
3 Proposed method
3.1 Overview
Figure 2 illustrates CGS with three passes: preview segmentation for context evaluation, contextual segmentation, guided segmentation based on propagation. In particular, in the first pass, we propose Instance Re-Identification Flow (IRIF) to generate the preview mask sequence and extract different contextual properties from each instance. In the second pass, we introduce multiple segmentation schemes corresponding to extracted properties. In the third pass, we develop fined-grained segmentation based on guided propagation. We remark that each instance is processed independently over frames of a video sequence. Finally, instance masks are then blended with reference to depth information, human and non-human instance interaction, and rare instance priority.
3.2 Preview Segmentation
Figure 3 illustrates the flowchart of Instance Re-Identification Flow (IRIF) for preview segmentation. The segmentation performed on the current frame is based on the history information of the previous frames. The segmentation result of the current frame is further fed to the process of the coming frame.
We remark that in this component, we consider two types of instance, i.e., human and non-human, to treat each instance in different ways. Given the first frame with its ground-truth label, we extract the bounding box for each instance and then perform human/non-human classification for all instances using Mask R-CNN [18].
3.2.1 Instance localization and tracking
For each video frame, we localize and track instances in a re-identification manner. Note that we expand the bounding box to 10% to well capture the whole area of the object instances. For human objects, we employ person search [68] by detecting person by using Faster R-CNN and then extracting person re-identification feature for all detected person region. On the other hand, DeepFlow [67] and Deformable Part Models (DPM) [14] are utilized to detect and track non-human objects.
3.2.2 Adaptive online learning for instance segmentation
For each instance, to identify each pixel as foreground (instance) or background, we utilize multiple binary SVM classifiers [6] which is learned from the appearance of the previous n frames with sampling step size \(\delta \), where n and \(\delta \) are set as 8 and 2, respectively. Note that our multiple binary SVM classifiers are implemented for history reference with several unary instances, e.g., saliency [36], CNN features [23], location of the bounding box, and color, to segment each instance within its tracked bounding box in each frame. We only update the SVM model if the size of one instance significantly changes. We then utilize GrabCut [49] for each instance to separate it from the background. After this step, each pixel is assigned with the instance ID.
Specifically for human instance, in case the instance is missing and re-appears in the next couple of frames, we adopt the state-of-the-art image parser, Pyramid Scene Parsing (PSPNet) [81] with the pre-trained model on PASCAL VOC dataset [13]. The re-identification results from PSPNet are blended into our segmentation outcomes.
3.2.3 Contextual property extraction
This component aims to determine the context of an instance so that we can apply an appropriate segmentation scheme for that instance. The context can be any observable properties that may affect the strategy to extract the mask of an instance in frames efficiently. In this work, we consider the following three attributes of an instance as its context: human or non-human, known or unknown category, rigid or deformable.
The category of an instance, such as person, car, and dog, can be directly inferred from its initial mask using pre-trained Mask R-CNN [18] on the MS-COCO dataset.
To evaluate if an instance is rigid or deformable, we analyze the preview sequence of instance masks in the first \(n_{Preview}\) frames. If there exists a homography matrix to transform the instance from the first frame to another frame for most frames in the first \(n_{Preview}\) frames, we consider the instance to be rigid.
3.3 Contextual segmentation
Each instance is segmented in different appropriate ways in this contextual segmentation, adapting to its extracted contextual properties (i.e., human/non-human, rigid/deformable, known/unknown category).
3.3.1 Human instance segmentation
We employ Mask R-CNN [18], pre-trained on the MS-COCO dataset [34], to extract human segments. However, the results of Mask R-CNN may be affected by occlusion or unusual human pose.
To overcome this issue, we develop skeleton-guided segmentation. We use the skeletons from OpenPose [4] for reference to control and refine human instance segmentation. For a human instance with an unusual pose that Mask R-CNN cannot recognize, we dilate the skeleton to obtain a skeleton-guided region, i.e.,an image with only the region containing the complete human instance. We then apply Mask R-CNN on a skeleton-guided region. By eliminating unrelated content, Mask R-CNN has a higher chance to extract human instance segment correctly (see Fig. 4). To preserve the inter-frame mask consistency, we use object flow [60] to correct and refine the result across frames.
3.3.2 Rigid non-human instance segmentation
For this type of instance, our objective is to accurately extract such instances from different backgrounds in the same scene category with the initial frame. Our method to process each instance is as follows. First, we synthesize images from the first frame of a video sequence, resulting in Wonderland Data. Second, to segment instances inside bounding boxes, we train DeepLab2 [7] and OSVOS [2] on our synthesized Wonderland Data.
Wonderland Data Generation Differently from existing work, we exploit various contextual properties from instances. After that, multiple segmentation schemes are performed for each instance, adapting to its extracted contextual properties. Inspired by Lucid Data [22], we introduce new augmented data, namely Wonderland Data. To generate visual variations of the initial mask, we apply both affine and non-rigid deformations, together with illumination changes, on the mask. We also replace the background with most similar scenes filtered out from a large-scale Places365 dataset [82] to preserve the semantics of the image. In this way, we can increase more training samples than Lucid Data (10,000 images for each video, in comparing with 2,500 images of Lucid Data) to deal with one-shot learning.
Figure 5 illustrates our proposed Wonderland Data generation. In this work, from a pair of an input image and a mask, we generate 10, 000 different pairs of synthesized images and masks. The Wonderland Data is published on our website.Footnote 1 We collect scene photos from the training set of the Places365 dataset [82], which has about 8 million images divided into 365 scene categories. We manually discard artificial scenes, use only 22 natural scene categories with 592k images. For each image, we extract a feature at the last layer of DenseNet-161 [20], which was pre-trained on the Places365 dataset [82]. This feature is used to build a hierarchical k-mean search for each category independently. We assume that each node has M images, and a leaf node has maximum L images. To cluster images at a node, we propose to use K-mean algorithm with \(K=\min (M \backslash L, T)\). In this work, we empirically set \(L=200\) and \(T=200\) to speed up clustering.
We classify an input image into the corresponding category, using the pre-trained DenseNet-161 on the Places365 challenge dataset. We also extract a channel feature at the last layer of the same network. After that, we search leaf nodes by comparing the Euclidean distance between the feature of an input image and the center of clusters. To search N images, we randomly choose \(80\%\) number of images of the nearest leaf node and \(70\%, 60\%, 50\%\), etc. number of images of next leaf nodes, respectively.
We also extract the object mask from the input image, then transform the object and searched scenes independently, similarly to [22]. In more detail, we use affine transformation (e.g.,translation, rotation, and scale) and non-rigid deformations, together with illumination changes. Figure 6 shows examples of Lucid Data and our Wonderland Data.
Network Training Figure 7 shows our training process, including domain-based training and object-based training. In domain-based training, we fine-tune pre-trained networks (i.e., DeepLab2 [7] pre-trained on COCO-Stuff dataset [3] and OSVOS [2] pre-trained on ImageNet dataset [50]) on the DAVIS training data for domain transformation. In object-based training: we fine-tune networks on the ground-truth mask of each instance of each video. We remark that we use only the first frame of videos and apply the proposed Wonderland Data generation method for these images.
3.3.3 Deformable non-human instance segmentation
For this instance type, we categorize instances into two groups, namely, known and unknown categories. For the known categories, i.e., already listed in MS-COCO dataset [34], we simply adopt Mask R-CNN to retrieve the instance segments. We directly obtain the preview results from our IRIF component for the unknown categories since it can handle arbitrary object categories.
3.4 Guided segmentation
Traditional Fully Convolutional Networks (FCNs) consider the entire rectangular region of interest (ROI) as the input to segment objects inside the ROI. This can lead to incorrect boundary segmentation due to the complex background and concave hull of the object. To overcome this limitation, we aim to transform a rectangular ROI to a non-rectangular ROI across the object boundary to eliminate the complex background inside the ROI (see Fig. 8). In particular, we utilize referral information from extra frames to identify the shape of the instance of interest inside the ROI of the current frame. We propose to apply guided attention to construct the non-rectangular ROI and then perform fine-grained segmentation on this guided non-rectangular ROI.
3.4.1 Bidirectional propagation
In particular, we propose bidirectional strategies to construct adaptive attention for guided segmentation. Particularly, initial segments from neighbor frames are used as references for segmentation at the current frame. Attention is computed in two strategies sequentially, i.e., forward propagation and back-propagation, in specific ways adapting the context. Forward propagation strategy, where attention is referenced from initial segments of previous frames, can correct excessed segmentation due to dense objects in a ROI (cf. Fig. 9a). Meanwhile, the back-propagation strategy, where attention is referenced from initial segments of next frames, can recover missing instances due to fast motion, occlusion, or heavy deformation (size changing from tiny to large or vice versa) (cf. Fig. 9b).
3.4.2 Guided non-rectangular ROI construction
To construct a guided non-rectangular ROI, we expand the mask of the interest instance at neighbor frames and then transfer and combine them at the current frame. This guarantees that the ROI can cover the entire interest instance. We do not apply mask propagation to avoid inaccurate flow warping as well as reducing the complexity of computation. Then, we create a smooth transition region (by applying a blurred mask to remove background) for the guided ROI to avoid a clear border between the ROI and background. It is essential to make the segmentation method focus on the interest instance and avoid inaccurate segmentation due to a clear border. We remark that the range of boundary expansion and transition smooth is computed based on the intensity of movement of the instance. Both propagation strategies are performed adaptively if initial segments of the interest instance at the current frame are much different (in appearance or size) from those at neighbor frames or the instance re-appears. On the other hand, we only refine the interest instance at the current frame to save the computational cost.
3.4.3 Fine-grained segmentation
We use Deep Grabcut [72] and Mask R-CNN [18] for fine-grained segmentation in guided non-rectangular ROIs. Inspired by Luiten et.al. [38], we train DeepLab3+ [8] based on Xception-65 [10] backbone on MS-COCO [34] and Mapillary [40] datasets to enhance the network generalization. For Mask R-CNN, we directly use a pre-trained model on MS-COCO [34] dataset.
3.5 Refinement and merging
Through preliminary results, we observe that the initial segmentation is not smooth enough. Therefore, we refine instance masks to improve segmentation quality, using rare instance attention and boundary snapping.
3.5.1 Rare instance attention refinement
We further refine the results by considering the rare instances. We observe that rare objects are shrunk due to larger objects. To identify rare object instances, we compute each object instance mask percentage in terms of area (provided in the first frame). Instances with a size smaller than 5% the total size of tracking objects are considered rare ones. We assume that a smaller object tends to be small in the whole video. Next, we recover rare object instances by transferring the results produced by the foreground probability obtained from the binary SVM classifier on each object instance.
3.5.2 Boundary snapping refinement
We also adopt boundary snapping [2] to further refine object shapes. In particular, we extract the saliency [36] and the contour [76] from the video frame. The salient pixels close to the contour are snapped.
3.5.3 Topological order estimation for instance merging
It is essential to determine the topology relationship (in terms of z-order) between multiple instances to sequentially combine corresponding masks of different instances into the final result. We here merge instances based on human and non-human instance interaction, depth values, and rare instance priority heuristics in this order as follows:
-
Human and non-human instance interaction We define interaction heuristics as follow: transportation instances (such as horse, bike, motor, surfboard, and skateboard) are the farthest from the camera; human instance have the middle distance to the camera; and small non-human instances which can be held, bring, touch, etc. are the nearest from the camera. Interacted small non-human instances are localized at the human hand’s position using OpenPose [4].
-
Depth values We first estimate pixel-wise depth values of the video frame, using DCNF-FCSP [37], and then take the average value for each instance.
-
Rare instance priority We notice that rare instances are always the nearest ones from the camera.
4 Experimental results
4.1 Dataset benchmark and metrics
We participated in the DAVIS Challenges 2017–2019, Semi-Supervised TrackFootnote 2Footnote 3Footnote 4 and evaluated our methods on the DAVIS Test-Challenge dataset. The dataset consists of 150 sequences, totaling 10, 459 annotated frames and 376 instances. There are a total of 30 video sequences for testing, and their ground truth is not publicly available. Submissions were made through the CodaLab site of the challenge.Footnote 5 This dataset is challenging due to multiple object instances with more distractors, i.e., smaller instances and fine structures, more occlusions, and fast motion.
For the evaluation metrics, per-instance measures are used as described in [45]: Region Jaccard (J) and Boundary F measure (F). The overall measures are computed as the mean between J and F, and both are averaged over all objects.
4.2 Results on DAVIS challenges 2017–2019
4.2.1 DAVIS 2017 challenge
Due to the time limit, we submitted the proposed IRIF component in the DAVIS 2017 Challenge and achieved 3rd place out of 22 team submissions in this challenge. As shown in Table 1, our proposed IRIF achieves very promising results in the DAVIS 2017 Challenge, namely, 0.615, 0.662, and 0.638 in terms of region similarity (Jaccard index), contour accuracy (F measure), and global score, respectively. Our results highly indicate that our method is competitive among the state-of-the-art methods in this dataset. Our method maintains the performance as frames evolve, as seen via the best performance in terms of J decay and F decay among the leading submissions in 2017.
4.2.2 DAVIS 2018 challenge
We also had another submission of CIS framework to the DAVIS 2018 Challenge and achieved 6th place out of 41 team submissions in this challenge. Table 1 shows that our CIS achieves promising results, namely, 64.1%, 68.6%, and 66.3% in terms of region similarity (Jaccard index), contour accuracy (F measure), and global score, respectively. Our method also maintains the best stable performance in terms of J decay and F decay among the leading submissions in 2018.
4.2.3 DAVIS 2019 challenge
As shown in Table 1, we obtained very competitive results. Our proposed CGS achieved 0.724, 0.784, and 0.754 in terms of region similarity (J), contour accuracy (F), and global score, respectively. Our method achieved the best performance in Decay and Recall of all metrics consistently. Furthermore, we note that our CGS is in top 3 over 4 teams achieving 0.75 in terms of global score in all 3 years.
4.2.4 Ablation study
Table 2 shows the results of our proposed framework with different settings. Our proposed CGS (using all three passes) outperforms using only two passes [59] or a pass [26]. This highlights the significant contribution of the second pass and the third pass, which are the multiple contextual segmentation schemes, and guided instance segmentation, respectively. Particularly, contextual segmentation can improve the performance up to 2.5%. Meanwhile, guided segmentation improves contextual segmentation up to 9.1% in the global score.
Figure 10 visualizes segmentation results. From top row to bottom row, we can observe the first video frame and a triple of processed video frames of our proposed methods in preview segmentation [26], contextual segmentation [59], and guided segmentation [58]. Our final CGS results surpass the performance of others and successfully track and segment the key instances. Our framework can even handle camouflaged instances, small instances, and occluded instances.
5 Conclusion
In this paper, we propose the novel CGS framework for semi-supervised instance segmentation in videos with three segmentation passes. In the first pass, we develop the novel IRIF for preview instance segmentation and extract contextual information. In the second pass, we introduce multiple contextual segmentation schemes to deal with different instance types, such as human/non-human rigid/non-rigid instances in known/unknown object categories. In the final pass, we propose a novel guided fined-grained segmentation based on attention to eliminate complex background inside the region of interest for performance improvement.
Our proposed methods achieve competitive results among the leading submissions in the DAVIS Challenges consistently, i.e.,3rd place, 6th place, and 3rd place in 2017, 2018, and 2019, respectively. Our full framework CGS is in the top 3 over 4 teams achieving 0.75 in terms of global score in all 3 years. Our method also maintains the best stable and recall performance among the leading submissions.
In the future, we plan to consider modeling the semantic relationship among object instances in the segmentation process. We will also investigate Capsule-inspired [19, 51, 52, 79], and attention-inspired [5, 11, 12, 29] network architectures for better segmentation performance. We also aim to extend our work to camouflage analysis [24, 27, 75] in the near future.
References
Brabandere, B.D., Neven, D., Gool, L.V.: Semantic instance segmentation for autonomous driving. In: CVPR Workshops (2017)
Caelles, S., Maninis, K.-K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: CVPR (2018)
Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: CVPR (2017)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV, pp. 213–229 (2020)
Chang, C., Lin, C.: LIBSVM: a library for support vector machines. Trans. Intell. Syst. Technol. 2(3), 1009 (2011)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. Trans. Pattern Anal. Mach. Intell. 40(4), 179 (2018)
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV (2018)
Cheng, J., Liu, S., Tsai, Y.-H., Hung, W.-C., Gupta, S., Gu, J., Kautz, J., Wang, S., Yang, M.-H.: Learning to segment instances in videos with spatial propagation network. In: CVPR Workshops (2017)
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: CVPR (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ICLR 3, 10076 (2021)
Duke, B., Ahmed, A., Wolf, C., Aarabi, P., Taylor, G.W.: Sstvos: sparse spatiotemporal transformers for video object segmentation. In: CVPR (2021)
Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 10007 (2010)
Felzenszwalb, P.F., McAllester, D.A., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)
Fiaz, M., Mahmood, A., Jung, S.K.: Video object segmentation using guided feature and directional deep appearance learning. In: CVPR Workshops (2020)
Guo, H., Wang, W., Guo, G., Li, H., Liu, J., He, Q., Xiao, X.: An empirical study of propagation-based methods for video object segmentation. In: CVPR Workshops (2019)
Guo, P., Zhang, L., Zhang, H., Liu, X., Ren, H., Zhang, Y.: Adaptive video object segmentation with online data generation. In: CVPR Workshops (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV (2017)
Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: International Conference on Artificial Neural Networks, pp. 44–51 (2011)
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)
Ji, J., Buch, S., Soto, A., Niebles, J.C.: End-to-end joint semantic segmentation of actors and actions in video. In: ECCV (2018)
Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for object tracking. In: CVPR Workshops (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)
Le, T.-N., Cao, Y., Nguyen, T.-C., Le, M.-Q., Nguyen, K.-D., Do, T.-T., Tran, M.-T., Nguyen, T.V.: Camouflaged instance segmentation in-the-wild: Dataset and benchmark suite. arXiv pre-print: arXiv:2103.17123 (2021)
Le, T.-N., Nguyen, K.-T., Nguyen-Phan, M.-H., Ton, T.-V., Nguyen, T.-A., Trinh, X.-S., Dinh, Q.-H., Nguyen, V.-T., Duong, A.-D., Sugimoto, A., Nguyen, T.V., Tran, M.-T.: Instance re-identification flow for video object segmentation. In: CVPR Workshops (2017)
Le, T.-N., Nguyen, K.-T., Nguyen-Phan, M.-H., Ton-That, V., Nguyen, T.-A., Trinh, X.-S., Dinh, Q.-H., Nguyen, V.-T., Duong, A.D., Sugimoto, A., Nguyen, T.V., Tran, M.-T.: Instance re-identification flow for video object segmentation. In: CVPR Workshops (2017)
Le, T.-N., Nguyen, T.V., Nie, Z., Tran, M.-T., Sugimoto, A.: Anabranch network for camouflaged object segmentation. J. Comput. Vis. Image Underst. 184, 45–56 (2019)
Le, T.-N., Nguyen, T.V., Tran, Q.-C., Nguyen, L., Hoang, T.-H., Le, M.-Q., Tran, M.-T.: Interactive video object mask annotation. In: AAAI (2021)
Le, T.-N., Sugimoto, A., Ono, S., Kawasaki, H.: Attention r-cnn for accident detection. In: IEEE Intelligent Vehicles Symposium (2020)
Lee, Y.J., Grauman, K.: Predicting important objects for egocentric video summarization. IJCV 114(1), 1073 (2015)
Li, X., Loy, C.C.: Video object segmentation with joint re-identification and attention-aware mask propagation. In: CVPR Workshops (2018)
Li, X., Qi, Y., Wang, Z., Chen, K., Liu, Z., Shi, J., Luo, P., Loy, C.C., Tang, X.: Video object segmentation with re-identification. In: CVPR Workshops (2017)
Lin, A., Chou, Y., Martinez, T.: Flow adaptive video object segmentation. In: CVPR Workshops (2018)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
Liu, D., Yu, D., Dong, M., Ma, L., Shao, J., Wang, J., Wang, C., Zhou, P.: An effective multi-level backbone for video object segmentation. In: CVPR Workshops (2020)
Liu, N., Han, J.: Dhsnet: Deep hierarchical saliency network for salient object detection. In: CVPR (2016)
Liu, N., Han, J., Zhang, D., Wen, S., Liu, T.: Predicting eye fixations using convolutional neural networks. In: CVPR (2015)
Luiten, J., Voigtlaender, P., Leibe, B.: Premvos: Proposal-generation, refinement and merging for the davis challenge on video object segmentation. In: CVPR Workshops (2018)
Luiten, J., Voigtlaender, P., Leibe, B.: Combining premvos with box-level tracking for the 2019 davis challenge. In: CVPR Workshops (2019)
Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: ICCV (2017)
Newswanger, A., Xu, C.: One-shot video object segmentation with iterative online fine-tuning. In: CVPR Workshops (2017)
Nguyen, K., Nguyen, K., Le, D., Duong, D.A., Nguyen, T.V.: YADA: you always dream again for better object detection. Multim. Tools Appl. 78(19), 28189–28208 (2019)
Nguyen, K., Nguyen, K., Le, D., Duong, D.A., Nguyen, T.V.: You always look again: Learning to detect the unseen objects. J. Vis. Commun. Image Represent. 60, 206–216 (2019)
Oh, S.W., Lee, J., Xu, N., Kim, S.J.: A unified model for semi-supervised and interactive video object segmentation using space-time memory networks. In: CVPR Workshops (2019)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv:1704.00675 (2017)
Robinson, A., Lawin, F.J., Danelljan, M., Felsberg, M.: Discriminative learning and target attention for the 2019 davis challenge on video object segmentation. In: CVPR Workshops (2019)
Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., Felsberg, M.: Learning fast and robust target models for video object segmentation. In: CVPR (2020)
Rother, C., Kolmogorov, V., Blake, A.: “grabcut’’: interactive foreground extraction using iterated graph cuts. Trans. Gr. 23(3), 9007 (2004)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. IJCV 115(3), 174 (2015)
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: NeurIPS (2017)
Sabour, S., Tagliasacchi, A., Yazdani, S., Hinton, G.E., Fleet, D.J.: Unsupervised part representation by flow capsules. arXiv preprint arXiv:2011.13920 (2020)
Seong, H., Hyun, J, Kim, E.: A kernel-based approach for video object segmentation. In: CVPR Workshops (2020)
Shaban, A., Firl, A., Humayun, A., Yuan, J., Wang, X., Lei, P., Dhanda, N., Boots, B., Rehg, J.M., Li, F.: Multiple-instance video segmentation with sequence-specific object proposals. In: CVPR Workshops (2017)
Sharir, G., Smolyansky, E., Friedman, I.: Video object segmentation using tracked object proposals. In: CVPR Workshops (2017)
Sun, J., Yu, D., Li, Y., Wang, C.: Mask propagation network for video object segmentation. In: CVPR Workshops (2018)
Tran, M.-T., Hoang, T., Nguyen, T.V., Le, T.-N., Nguyen, E., Le, M., Nguyen-Dinh, H., Hoang, X., Do, M.N.: Multi-referenced guided instance segmentation framework for semi-supervised video instance segmentation. In: CVPR Workshops (2020)
Tran, M.-T., Le, T.-N., Nguyen, T.V., Ton-That, V., Hoang, T.-H., Bui, N.-M., Do, T.-L., Luong, Q.-A., Nguyen, V.-T., Duong, D.A., Do, M.N.: Guided instance segmentation framework for semi-supervised video instance segmentation. In: CVPR Workshops (2019)
Tran, M.-T., Ton-That, V., Le, T.-N., Nguyen, K.-T., Ninh, T.V., Le, T.-K., Nguyen, V.-T., Nguyen, T.V., Do, M.N.: Context-based instance segmentation in video sequences. In: CVPR Workshops (2018)
Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: CVPR (2016)
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.-C.: Feelvos: Fast end-to-end embedding learning for video object segmentation. In: CVPR (2019)
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation. In: CVPR Workshops (2017)
Petrosyan, V., Örnsberg, O., Proutiere, A.: Video object segmentation via tracking edges and classifying segments. In: CVPR Workshops (2018)
Vu-Le, T., Nguyen-Le, H., Nguyen, E., Do, M.N., Tran, M.: Video object segmentation with memory augmentation and multi-pass approach. In: CVPR Workshops (2020)
Wang, B., Zheng, C., Wang, N., Wang, S., Zhang, X., Liu, S., Gao, S., Lu, K., Zhang, D., Shen, L., Wang, Y., Xu, Y.: Object-based spatial similarity for semi-supervised video object segmentation. In: CVPR Workshops (2019)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: CVPR (2019)
Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: Large displacement optical flow with deep matching. In: ICCV (2013)
Xiao, T., Li, S., Wang, B., Lin, L., Wang, X.: Joint detection and identification feature learning for person search. In: CVPR (2017)
Xie, H., Huang, Y., Xu, A., Lan, J., Sun, W.: Depth-aware space-time memory network for video object segmentation. In: CVPR Workshops (2020)
Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., Urtasun, R.: Upsnet: A unified panoptic segmentation network. In: CVPR (2019)
Xu, K., Wen, L., Li, G., Bo, L., Huang, Q.: Spatiotemporal cnn for video object segmentation. In: CVPR (2019)
Xu, N., Price, B., Cohen, S., Yang, J., Huang, T.: Deep grabcut for object selection. In: BMVC (2017)
Xu, S., Bao, L., Zhou, P.: Class-agnostic video object segmentation without semantic re-identification. In: CVPR Workshops (2018)
Xu, S., Liu, D., Bao, L., Liu, W., Zhou, P.: Mhp-vos: Multiple hypotheses propagation for video object segmentation. In: CVPR (2019)
Yan, J., Le, T.-N., Nguyen, K.-D., Tran, M.-T., Do, T.-T., Nguyen, T.V.: Mirrornet: bio-inspired camouflaged object segmentation. IEEE Access 9, 43290–43300 (2021)
Yang, J., Price, B., Cohen, S., Lee, H., Yang, M.H.: Object contour detection with a fully convolutional encoder-decoder network. In: CVPR (2016)
Yang, Z., Ding, Y., Wei, Y., Yang, Y.: Cfbi+: Collaborative video object segmentation by multi-scale foreground-background integration. In: CVPR Workshops (2020)
Zhang, P., Hu, L., Zhang, B., Pan, P.: Spatial constrained memory network for semi-supervised video object segmentation. In: CVPR Workshops (2020)
Zhang, W., Tang, P., Zhao, L.: Remote sensing image scene classification using cnn-capsnet. Remote Sens. 11(5), 494 (2019)
Zhao, H.: Some promising ideas about multi-instance video segmentation. In: CVPR Workshops (2017)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. Trans. Pattern Anal. Mach. Intell. 5, 1700 (2017)
Acknowledgements
This research is funded by Gia Lam Urban Development and Investment Company Limited, Vingroup, supported by Vingroup Innovation Foundation (VINIF) under project code VINIF.2019.DA19, and National Science Foundation (NSF) under Grant No. 2025234. The first author would like to thank JSPS KAKENHI Grants (JP16H06302, JP18H04120, JP21H04907, JP20K23355, JP21K18023) and JST CREST Grants (JPMJCR20D3, JPMJCR18A6). We also thank NVIDIA and AIOZ Pte Ltd for the support of GPU and computing infrastructure.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Le, TN., Nguyen, T.V. & Tran, MT. Contextual Guided Segmentation Framework for Semi-supervised Video Instance Segmentation. Machine Vision and Applications 33, 24 (2022). https://doi.org/10.1007/s00138-022-01278-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-022-01278-x