1 Introduction

Nowadays, video object segmentation is an important research area owning to its wide variety of applications such as content-aware retargeting [5], content-based retrieval [3] and video surveillance, just to name a few. Continuous improvements on video segmentation methods have been achieved in recent years [20]. Unsupervised video segmentation methods were proposed in [7, 18], which exploited appearance information and motion cues simultaneously. In [13, 33], object proposals are introduced as a preprocessing step to catch high-level features and obtain more robust segmentation results. However, unsupervised methods may be invalid due to complex situations in different videos. In contrast, supervised methods [19, 28] provide manually annotated labels for objects in some frames and design the appearance models with motion cues, for the better segmentation quality. Nevertheless, supervised methods are hard to be extended due to the need of manual intervention. Therefore, in order to achieve a better video segmentation performance in an unsupervised manner, video co-segmentation methods [2, 4, 8,9,10,11, 16, 23, 25, 27, 34, 35] were introduced to compensate for the lack of supervision by exploiting additional information across multiple videos.

The aim of video object co-segmentation is to pursue a better segmentation for each object via mining the similarity information among different videos. It is also regarded as an extension of image co-segmentation [22, 24] and image co-saliency [12, 15, 32]. The video co-segmentation methods in [2, 23] enforced a cooperative constraint with a global appearance model for common objects. In [9, 10], the constraints of trajectory co-saliency and coherent moving for the foreground local parts were introduced into video co-segmentation respectively. A multi-class video co-segmentation method was proposed in [4], which was based on a nonparametric Bayesian model. This model used a video segmentation prior as well as a global appearance model that grouped dense image patches to obtain segments, which could potentially yield noisy results. In [25], a spatiotemporal SIFT flow was designed to generate estimations of common objects. Then, intra-frame saliency, inter-frame consistency and estimations of common objects were incorporated into an energy optimization framework to obtain the common object regions.

Recently, the object-based methods were introduced in [8, 16, 29, 34], which discovered the common objects by mining the consistency information among the segmentation proposals. Specifically, bounding boxes are used in [29] to initialize a set of easy instances clusters for an input video, and then an iterative growing process is exploited to detect the harder instances throughout the entire video for each cluster. However, this method only focuses on segmenting objects in a single video. The method [16] built a probabilistic graphical model across a set of videos based on object proposals in each frame. Then an energy function incorporating appearance, spatial and temporal consistency of primary objects was solved to obtain common objects. The method in [8] formulated a multi-state selection graph in which each proposal was a state in each frame node. Object score, region overlapping, intra-video and inter-video coherence were considered in the graph to optimize the segmentation of different objects jointly. However, this method suffers from restrictive assumptions that the number of object classes in each video should be the same. It cannot deal with the situation that the common object does not appear in some videos. In [34], object proposals were used to form a number of tracklets, and then each tracklet was treated as a node to construct an undirected graph; inter-video and intra-video constraints were considered to set an edge connecting each pair of nodes, and then edges with the similarity lower than a manually set threshold were removed; last, the multi-class object extraction problem was solved by obtaining the regularized maximum weight cliques iteratively. However, in [34], a manual setting is needed to indicate the similarity of common objects in all videos, and this makes the method hard to be extended to practical applications and may cause failure in the situation that the common objects have a large appearance difference among videos.

As an improvement work on [8], in [35], object clusters are used to construct a weighted graph, which is employed to highlight the common object. In [11], a co-saliency based segmentation scheme is built based on superpixels and employed to find out the co-salient object regions among videos. In [27], video co-segmentation is designed by minimizing an energy function, which incorporates co-saliency term, intra-video appearance term, inter-video appearance term and spatiotemporal smoothness term. However, since these methods [11, 27, 35] utilize the co-saliency maps to segment the common objects, their common limitation is incapable of identifying different classes of objects in the video set.

Therefore, in this paper, we propose a novel video co-segmentation method, which aims at achieving more effective and automatic co-segmentation of multi-class objects. Our main contribution lies in the following two aspects. First, a novel directed graph is designed to effectively connect object tracklets via the tracklet-wise co-saliency maps. In [34], objects with similarity lager than a threshold from different videos will be considered as the common objects. Different from [34], in our method two salient objects that are in different videos and point to each other by the directed edges are treated as belonging to the same class. This is an obvious superiority over [34] since the proposed directed graph can avoid the fine-tuning on the threshold for the object similarity in [34]. Besides, different from the co-saliency based methods [11, 27, 35], our tracklet-wise co-saliency maps are used to connect similar object tracklets and can deal with the segmentation of multi-class objects. Second, we propose to perform video object co-segmentation via a new pipeline, which first extracts common object tracklets by solving a maximum clique problem, then generates object-level saliency maps by manifold ranking [31] and finally exploits GrabCut [21] to obtain the object co-segmentation results.

Different from the method in [29], which focuses on segmenting objects in a single video, our method aims at co-segmenting objects in multiple videos. In [29], after iterative growing, the obtained harder instances are used as weak supervision to segment foreground objects in each video frame. Differently, our method first adopts the obtained tracklets to generate tracklet-wise co-saliency maps by combining the initial saliency maps and tracklet-wise similarity maps, where the tracklet-wise similarity maps are generated based on similarity between different tracklets in different videos. Then, we construct the directed graph by connecting tracklets based on the tracklet-wise co-saliency maps between different videos, and the maximum weight clique (MWC) is extracted from the directed graph for the final segmentation.

The rest of this paper is organized as follows. Section 2 details the proposed video co-segmentation method illustrated in Fig. 1. Experimental results and analysis are presented in Section 3, and conclusions are given in Section 4.

Fig. 1
figure 1

Overview of the proposed video co-segmentation method. (a) Input videos; (b) object proposals; (c) object tracklets; (d) tracklet-wise co-saliency maps; (e) graph construction; (f) object-level saliency maps; (g) object co-segmentation results

2 Proposed video co-segmentation method

2.1 Preprocessing

Given a set of videos \( {\left\{{V}_m\right\}}_{m=1}^M \), in which each video Vm contains a set of frames \( {\left\{{F}_{m,t}\right\}}_{t=1}^{N_m} \). We transform each frame Fm, t into the Lab color space, which is more correlated with the human perception. For each frame Fm, t, each color channel is uniformly quantized into 16 bins for generating the color histogram. Then the category independent object proposal algorithm [6] is used to generate a set of candidate regions as object proposals \( {\left\{{x}_{m,t,i}\right\}}_{i=1}^q \) for each frame Fm, t. Let Ha denote the color histogram calculated for region Ra, the color similarity between any pair of regions, Ra and Rb, is defined as follows:

$$ Sim\left({R}_a,{R}_b\right)=1-{\chi}^2\left[{H}_a,{H}_b\right] $$
(1)

where χ2[⋅] denotes the chi-square distance between histograms. For the videos shown in Fig. 1(a), the object proposals generated for video frames are shown in Fig. 1 (b).

2.2 Tracklet generation

A number of proposals have been generated for each frame, and each proposal may correspond to a real object region, a background region, or a region with a part of object and a part of background. In order to evaluate the likelihood of each proposal belonging to a real object, we firstly obtain the initial saliency map ISm, t for each frame Fm, t by using the intra saliency map generation step in [30], which combines spatial saliency, temporal saliency and object prior for a maximal preservation of salient objects in terms of both appearance and motion. Specifically, the initial saliency map is defined as follows:

$$ I{S}_{m,t}(p)=\left[S{S}_{m,t}(p)+S{T}_{m,t}(p)\right]\cdot O{P}_{m,t}(p) $$
(2)

where p denotes each pixel in the video frame. SSm, t(p) denotes the spatial saliency, which is computed based on color contrast and spatial sparsity within the frame, and STm, t(p) denotes the temporal saliency, which is generated based on motion distinctiveness from background and temporal coherence in a period of consecutive frames. Besides, OPm, t(p) denotes the object prior, which is calculated based on the observation that salient object regions connect with image borders less than background regions, and such a prior can effectively highlight salient object regions and suppress background regions. Then for each proposal xm, t, i, the object score is calculated as follows:

$$ IOS\left({x}_{m,t,i}\right)=\frac{\sum_{p\in {x}_{m,t,i}}I{S}_{m,t}(p)}{\sum_{p\in {F}_{m,t}}I{S}_{m,t}(p)}\cdot \frac{\sum_{p\in {x}_{m,t,i}}I{S}_{m,t}(p)}{\mid {x}_{m,t,i}\mid } $$
(3)

where p denotes each pixel in the video frame, ∣xm, t, i∣ denotes the number of pixels in the proposal xm, t, i. According to Eq. (3), the proposal with a higher object score contains a larger object region with a smaller background region. Then we track each proposal xm, t, i backward and forward along the whole video to form the corresponding track Xm, t, i.

For tracking object proposals, a similarity function combining color and location between any pair of proposals extracted from adjacent frames is defined as follows:

$$ S\left({x}_{m,t,i,}{x}_{m,u,j}\right)= Sim\left({x}_{m,t,i,}{x}_{m,u,j}\right)\cdot \frac{\mid {x}_{m,t,i}\cap war{p}_{u,t}\left({x}_{m,u,j}\right)\mid }{\mid {x}_{m,t,i}\cup war{p}_{u,t}\left({x}_{m,u,j}\right)\mid } $$
(4)

where warpu, t(xm, u, j) denotes the warped region from proposal j of frame u by using optical flow [14] to frame t. For example, based on proposal i in frame u, we can find the most similar proposal j in frame t + 1 according to Eq. (4); and then based on proposal j, we can also find the most similar proposal l in frame t + 2. This process will be performed iteratively in consecutive frames, and it is also performed similarly in the temporally backward direction. Through this way, we can generate a track Xm, t, i from the whole video for the proposal xm, t, i. As shown in Fig. 2, two tracking proposals examples are used to generate corresponding tracks. This process will generate a large number of tracks since each object proposal corresponds to a track. Most of the generated tracks are overlapping and therefore are redundant. A non-maximum suppression process is then performed to remove the redundant tracks for each video. Specifically, we first calculate the object score for each track as follows:

Fig. 2
figure 2

Illustration of the track generation. This picture only shows two examples of tracking proposals backward and forward to generate corresponding tracks. Each line with arrow points to the most similar proposal in the next frame or previous frame

$$ ITS\left({X}_{m,t,i}\right)={\sum}_{x\in {X}_{m,t,i}} IOS(x) $$
(5)

Based on Eq. (5), the track with the highest score is selected as a reference track XR. Then we calculate the overlap ratio for all the other tracks Y = {Y1, ..., YNT} with respect to the reference track XR as follows:

$$ O\left({X}^R,{Y}^l\right)=\frac{\sum \limits_{t=1}^{N_m}{\sum}_{x_{m,t,i}\in {X}^R,{y}_{m,t,j}\in {Y}^l}\mid {x}_{m,t,i}\cap {y}_{m,t,j}\mid }{\sum \limits_{t=1}^{N_m}{\sum}_{x_{m,t,i}\in {X}^R,{y}_{m,t,j}\in {Y}^l}\mid {x}_{m,t,i}\cup {y}_{m,t,j}\mid } $$
(6)

Based on Eq. (6), the tracks which have the overlap ratio larger than 0.5 with XR will be removed. In the remaining track set, we select the track with the second highest score as a reference track, and then a selection operation as above will be performed again; such a process will be performed continuously until each of the remaining tracks has been used as the reference track. Then each remaining track is split into some tracklets to ensure the consistency of object proposals. Specially, the similarity between two proposals in adjacent frames is used to determine whether the track should be split between two proposals or not. For each track, the mean of such similarity measures is calculated, if the similarity between any two proposals is 1.5 times the standard deviation away from the mean similarity measure, the track is split between the two proposals in the adjacent frames as shown in Fig. 3. Using the above split operation, we obtain the tracklets set \( X={\left\{{X}_{m,k}\right\}}_{m=1,...,M;k=1,...,{K}_m} \) for the whole video set \( {\left\{{V}_m\right\}}_{m=1}^M \), where Km is the number of final tracklets in Vm. The examples of generated tracklets is shown in Fig. 1(c).

Fig. 3
figure 3

Illustration of splitting one remaining track into some tracklets. Splitting situations occur in two positions (marked with a scissor) in this example, i.e., one is between frame 68 and 69, and another one is between frame 69 and 70

Algorithm 1 Pseudo Code of Similarity Maps Generation

figure f

2.3 Graph construction

We obtained a lot of tracklets in Section 2.2, and each tracklet represents some kind of object or background. Video co-segmentation aims at finding the relationship among videos and extracting multi-class objects from all videos. For this purpose, we propose a directed graph, which tries to connect each tracklet belonging to the same class and extract the salient ones as objects. We use G = (V, E, W) to denote the directed graph, where V is the set of vertices, E is the set of directed edges. Each vertex vm, k ∈ V represents a tracklet Xm, k, and e = (vm, k, vn, l) denotes the edge directing from vm, k to vn, l.

Based on the definition of directed graph, we try to connect each tracklet with the best-matched tracklet obtained from all frames in all videos. Here we make an assumption that each tracklet Xm, k tends to connect to the best-matched tracklet which contains complete and similar objects. This means that a tracklet containing a part of object may connect to a tracklet containing complete and similar objects, and two tracklets containing complete and similar objects tend to connect with each other. For our purpose, the matching score of one tracklet Xm, k with respect to another tracklet Xn, l is defined as follows:

$$ FT{S}_{m,k}\left({X}_{n,l}\right)= Sim\left({X}_{m,k},{X}_{n,l}\right)\cdot \left\{{\sum}_{x_{n,t,i}\in {X}_{n,l}}\left(\frac{\sum_{p\in {x}_{n,t,i}} TC{S}_{m,k}^{n,t}(p)}{\sum_{p\in {F}_{n,t}} TC{S}_{m,k}^{n,t}(p)}\cdot \frac{\sum_{p\in {x}_{n,t,i}} TC{S}_{m,k}^{n,t}(p)}{\mid {x}_{n,t,i}\mid}\right)\right\} $$
(7)

In Eq. (7), the tracklet-level similarity Sim(Xm, k, Xn, l) takes the form of Eq. (1), in which the two color histograms are calculated based on the pixels of all proposals in Xm, k and Xn, l respectively. The tracklet-wise co-saliency map of Xm, k with respect to the frame Fn, t is defined as follows:

$$ TC{S}_{m,k}^{n,t}(p)=S{M}_{m,k}^{n,t}(p)\cdot I{S}_{n,t}(p) $$
(8)

where \( S{M}_{m,k}^{n,t} \) is the tracklet-wise similarity map of Xm, k with respect to Fn, t. Eq. (8) indicates that the tracklet-wise co-saliency map is generated by integrating the initial saliency map with the tracklet-wise similarity map by multiplication operation.

The tracklet-wise co-saliency map \( TC{S}_{m,k}^{n,t} \) can highlight the regions, which are salient in the frame Fn, t and similar with the proposals in tracklet Xm, k. As shown in Fig. 4, the initial saliency maps highlight both objects, i.e., giraffe and elephant, based on appearance and motion, while the similarity maps of the tracklet corresponding to giraffe can highlight the regions of giraffe. By combining initial saliency maps with similarity maps, the obtained tracklet-wise co-saliency maps can better highlight the regions of giraffe, and suppress other regions irrelevant to the tracklet. It can be observed in Fig. 4 that the similarity map is important to generate the tracklet-wise co-saliency map. The generation process of similarity map for each tracklet is described in Algorithm 1.

Fig. 4
figure 4

Illustration of generating the tracklet-wise co-saliency maps. (a) One tracklet; (b) original frames; (c) initial saliency maps; (d) similarity maps; (e) tracklet-wise co-saliency maps

To intuitively demonstrate the effect of tracklet-wise co-saliency map, a comparison is shown in Fig. 5. Figure 5(a) shows the connection of two tracklets using Eq. (7) based on initial saliency map. As shown in Fig. 5(a), since the initial saliency maps highlight the regions of both the chicken and the turtle, the matching score of ‘chicken1’ with respect to ‘chicken2&turtle’ is larger than that with respect to ‘chicken2’, which results in the wrong matching. In contrast, as show in Fig. 5(b), since the tracklet-wise co-saliency maps of ‘chicken1’ only highlight the region of chicken, the matching score of ‘chicken1’ with respect to ‘chicken2’ is larger, which leads to the correct matching.

Fig. 5
figure 5

Illustration of difference between using initial saliency map and tracklet-wise co-saliency map in the directed graph construction. (a) Tracklet matching process using initial saliency maps; (b) tracklet matching process using tracklet-wise co-saliency maps. The red arrow line in (a) or (b) denotes the best matched tracklet

By using Eqs. (78), we can obtain the matching scores of each tracklet Xm, k with respect to all the other tracklets. The matching scores are exploited to find the best-matched tracklets for Xm, k. For each tracklet Xm, k, we search in each video frame to find another tracklet Xn, l with the matching score FTSm, k(Xn, l) as the highest one. Specifically, in video Vm, in which Xm, k exists, we only search for Xm, k its best matched tracklet Xn, l with Sim(Xm, k, Xn, l) > 0.9. Then a connection is added from Xm, k to Xn, l with a directed edge. The latter constraint of similarity with a high threshold, 0.9, is exploited to discard tracklets in those video frames where similar objects are unlikely to appear.

Figure 1(e) gives the illustration of graph construction. It can be seen that, based on the matching score function, a tracklet of ‘chicken1’ directs to the same object class ‘chicken2’ in video2 instead of ‘chicken2&turtle’ or ‘turtle’. Similarly, a tracklet of ‘chicken2’ directs to the same object class ‘chicken1’ instead of ‘part of chicken’ or ‘chicken2&turtle’. It indicates that the tracklets of complete and similar objects direct to each other, and such two objects pointing to each other by the directed edges can be treated as belonging to the same class.

2.4 Common object extraction

After connecting each tracklet with the best-matched tracklet, we transform the directed graph G = (V, E, W) into an undirected graph G' = (V, E', W). Based on the assumption that tracklets connecting to each other contain complete and similar objects. Any pair of tracklets is connected with an undirected edge if they are connected with each other by two directed edges. The pairwise connected tracklets in G' belong to a complete subgraph. Thus, we can regard the extraction of common object tracklets as the problem of Maximum Weight Clique (MWC). Those tracklets corresponding to the primary objects will have high object scores. We use Bron-Kerbosch algorithm [1] to find all maximal cliques,Ch(h = 1, ..., H), from G'. The weight of each clique is defined as follows:

$$ W\left({C}_h\right)={\sum}_{X_{m,k}\in {C}_h} ITS\left({X}_{m,k}\right) $$
(9)

Then the MWC from all cliques is obtained as follows:

$$ MWC=\arg \underset{C_h}{\max}\left[W\left({C}_h\right)\right] $$
(10)

The proposals in MWC are regarded as object queries to compute object-level saliency maps by using the manifold ranking algorithm [31]. Then for each frame Fm, t, we threshold the object-level saliency map adaptively by using the Otsu’s method [17]. The obtained background pixels are used to estimate the background Gaussian mixture model \( GM{M}_{m,t}^b \). The object Gaussian mixture model \( GM{M}_m^f \) is estimated for each video Vm based on the proposals which belong to Vm and MWC. Finally, the GrabCut [21] based on \( GM{M}_m^f \) and \( GM{M}_{m,t}^b \) is used to obtain the co-segmentation result of primary object class for each video frame Fm, t. For the object extraction of the second class, we set the pixels of the extracted primary object regions in the initial saliency maps to zero. Then we update the graph by recalculating the object scores of proposals and tracklets by using Eq. (3) and (5) as well as the weights of cliques by using Eq. (9). The objects of the second class can be extracted as a new MWC based on Eq. (10). The above process can be performed multiple times in order to extract multiple classes of objects from the video set. Examples of the final co-segmentation result are shown in Fig. 1(g), in which multi-class objects are represented by using different colors.

3 Experimental results

We performed experiments on two public video datasets MOViCS [4] and Safari [34]. The experimental settings, objective and subjective evaluations including comparisons with two state-of-the-art video co-segmentation methods and video segmentation method on MOViCS and Safari are presented in Section 3.1, Section 3.2 and Section 3.3, respectively. The computation issue of video co-segmentation methods is discussed in Section 3.4.

3.1 Experiments on MOViCS dataset

The proposed video co-segmentation method is evaluated on the public video dataset MOViCS [4], which contains a total of 11 videos grouped into 4 video sets. Each video set contains one or two object classes. Originally, the manually annotated ground truths are provided for only 5 frames in each video. For a comprehensive comparison, we extended the ground truths by uniformly sampling 20 frames in each video and manually annotated the pixel-level ground truths of all objects in these video frames. The proposed method is compared with the two state-of-the-art video co-segmentation methods including MVC [4] and RMWC [34], which can extract multi-class objects. We tested both MVC [4] and RMWC [34] with their publicly available codes. For RMWC [34], we used the best thresholds as suggested in [34] for the four video sets, and for MVC [4], we used the default parameter setting in the code.

The multi-class object co-segmentation results generated using our method and the other two methods are shown in Fig. 6 for an intuitive comparison. It can be observed that the results of MVC cannot provide a distinctive segmentation of real objects. Meanwhile, MVC tends to generate results which break one object into a number of fragments. RMWC fails to segment the common object into one class in the video set ‘Tiger’. Note that RMWC uses the manually set thresholds for similarity measures between common objects. This may result in errors when the common objects show a larger difference on appearance. Compared with the other two methods, our method can correctly segment each class of object and can obtain more complete segmentation results by using the directed graph, which avoids the manually parameter setting for similarity measures between common objects.

Fig. 6
figure 6

Video co-segmentation results on the MOViCS dataset. From left to right: original video frames, ground truths, co-segmentation results generated using the proposed method, RMWC [34] and MVC [4], respectively. From the 2nd to the 4th column, red regions correspond to the objects of the primary class and green regions correspond to the objects of the second class

Fig. 7 shows the effectiveness of our method in dealing with the absence of target objects in some frames or the whole video. Fig. 7(a) shows the appearance, occlusion and disappearance of lion and zebra in some frames, and Fig. 7(b) shows the absence of lion in the whole video. The segmentation results demonstrate that our method can deal with the situations of object appearance and disappearance in some frames or in the whole video.

Fig. 7
figure 7

Experimental results on the video set ‘Zebra&lion’ with two situations: (a) the appearance, occlusion and disappearance of the lion and the zebra in some frames; (b) the absence of the lion in the whole video

To objectively evaluate the video co-segmentation performance, following the setup in [4], we use the intersection-over-union metric to quantify the results as follows:

$$ M\left(S,G\right)=\frac{S\cap G}{S\cup G} $$
(11)

where S is a set of segments and G is the ground truth. The co-segmentation score of one object class in the video set is defined as follows:

$$ Scor{e}_j=\underset{S_i}{\max}\left[M\left({S}_i,{G}_j\right)\right] $$
(12)

where Si denotes all segments in the object class i, and Gj is the ground truth for the object class j. The final co-segmentation score for the video set is defined as the average score on all object classes as follows:

$$ Score=\frac{1}{C}{\sum}_j Scor{e}_j $$
(13)

where C is the number of object classes in the ground truth. Table 1 shows the quantitative comparison of our method with the other two methods. It can be seen that our method outperforms the other two methods in 3 out of 4 video sets. The only video set on which our method has a lower score is ‘Giraffe&elephant’ because the segments of giraffe are not complete enough. On average, our method achieves the best performance on the whole dataset, and this demonstrates the advantage of our directed graph based video co-segmentation.

Table 1 Quantitative evaluation on the MOViCS dataset

3.2 Experiments on safari dataset

We also evaluated the proposed method on another dataset, i.e. Safari [34], which contains 9 videos with 5 classes of animals. The proposed method is also compared with the other two state-of-the-art methods, i.e. MVC [4] and RMWC [34]. For both MVC and RMWC, we used the default parameter settings in the codes provided by their authors.

We show the visual comparisons between our method and the other two methods in Fig. 8. It can be seen that for different classes of objects with dissimilar appearances, such as the elephants in the video set ‘Elephant&sheep&giraffe&lion’, our method can achieve the better segmentation result than the other two methods. Our method does not perform well on the video set ‘Sheep&elephant’, in which some background regions show very similar colors with the sheep and elephant, while the other two methods also do not perform well on this video set. Table 2 shows the quantitative comparison of our method with the other two methods. We can see that our method consistently outperforms the other two methods on all the four video sets, due to the effectiveness and superiority of our directed graph based video co-segmentation method.

Fig. 8
figure 8

Video co-segmentation results on the Safari dataset. From left to right: original video frames, ground truths, co-segmentation results generated using the proposed method, RMWC [34] and MVC [4], respectively. From the 2nd to the 4th column, different colors indicate different object classes

Table 2 Quantitative evaluation on the Safari dataset

3.3 Comparison with the common video segmentation method

Following the settings in the reference [34], we compared our method with the state-of-the-art video segmentation method, SAGS [26], which aims at segmenting objects in a single video. We performed segmentations on both datasets, i.e. MOViCS and Safari, and some segmentation results are shown in Figs. 9 and 10, respectively, for a visual comparison. It can be seen that our method can identify multi-class objects by incorporating the information originated from other videos, such as the chicken and the turtle in the video ‘chicken_on_turtle’. Differently, due that both the chicken and the turtle are moving objects in this video, SAGS treats them as an entire object and segments them together. Besides, in some low-contrast videos, such as ‘tiger1_all8’, ‘tiger1_all9’ and ‘tiger1_all10’, SAGS fails to segment the tigers in these videos, but our method is able to segment the tigers by incorporating the information of these three videos simultaneously. Besides, we adopted Eq. (13) to evaluate the segmentation quality of the segmentation results generated using our method and SAGS, respectively, for each object class in each video, and the quantitative comparisons on the MOViCS dataset and the Safari dataset are shown in Tables 3 and 4, respectively. It can be seen that our method outperforms SAGS on most of the videos, and this clearly demonstrates the effectiveness of our method.

Fig. 9
figure 9

Comparison between our method and SAGS [26] on the MOViCS dataset. For each video, from left to right: original video frames, ground truths, segmentation results generated using the proposed method and SAGS, respectively. From the 2nd to the 3th column, different colors indicate different object classes

Fig. 10
figure 10

Comparison between our method and SAGS [26] on the Safari dataset. For each video, from left to right: original video frames, ground truths, segmentation results generated using the proposed method and SAGS, respectively. From the 2nd to the 3th column, different colors indicate different object classes

Table 3 Quantitative comparison between our method and SAGS [26] on the MOViCS dataset
Table 4 Quantitative Comparison between our method and SAGS [26] on the Safari dataset

3.4 Computation cost

We analyze the computation cost of all video co-segmentation methods in this subsection. All experiments are performed on a PC with Intel Core i7–3770 3.4GHz CPU and 16GB RAM. Table 5 reports the average processing time per frame of each method on videos with a resolution of 640 × 360 in MOViCS. The average processing time per frame taken by the MATLAB implementation of our method is 61.93 s. It can be seen from Table 5 that our method has a higher computation cost than MVC. The reason behind this is that our method employs the object proposal generation algorithm which takes lots of time, while the MVC uses the superpixels for co-segmentation. However, our method has a lower computation cost than RMWC.

Table 5 Comparison of average processing time per frame taken by different video co-segmentation methods

Therefore, in order to make our method more practical for applications with runtime requirements, the computational efficiency of the object proposal generation algorithm, which is the bottleneck of runtime, should be elevated with the highest priority. With the GPU-accelerated implementation of the object proposal generation algorithm and an optimized C/C++ implementation of other components in our method, we believe that the computation efficiency of our method can be substantially accelerated.

4 Conclusion

This paper proposes a novel video co-segmentation method, which enables an effective and automatic co-segmentation of multi-class objects in a set of videos via the proposed directed graph and a new pipeline for segmentation. Specifically, a novel directed graph is first constructed to connect similar object tracklets, which are generated by object proposals. Then, tracklet-wise co-saliency maps are generated to make the matching score more correct and the video co-segmentation is treated as the problem of MWC. Specifically, based on the extracted MWC, object-level saliency maps are generated by manifold ranking and the GrabCut is further exploited to obtain the final co-segmentation results. Experimental results show that the proposed method consistently outperforms the state-of-the-art video co-segmentation methods on both datasets.