1 Introduction

Video object segmentation is the process of automatically segmenting the object of interest from the entire video sequence, which is a critical step in various computer vision applications, such as video surveillance, behavioral understanding, activity recognition, video summarization and video retrieval. Broadly speaking, video object segmentation approaches can be divided into different categories based on the styles of segmentation: manual segmentation, semi-automatic segmentation, and fully automatic segmentation. The manual segmentation approaches mainly depend on the operator’s experience, and are normally time-consuming and laborious. On the other hand, semi-automatic video object segmentation approaches [2, 32, 45] need to annotate the target object in the key frame for initialization, and then use the motion and appearance constrained optimization techniques to propagate annotations throughout the video. Although these semi-automatic approaches can normally provide promising segmentation results, most computer vision applications have to process a large amount of video data, and the cost of manually annotating video frames is particularly expensive. To tackle this problem, various fully automated methods for video object segmentation have emerged. For instance, several fully automated segmentation methods process each frame of given videos by adopting the appearance and motion constraints to make the bottom-up segmentation [3, 20, 39].

A variety of other automatic segmentation methods have also been proposed, such as graph-based methods [12, 35], segmentation through clustering [5], binary partition tree [22] and so on. In recent years, deep learning techniques have shown promising performance in many applications, such as place recognition [44], dimension reduction [48], image ranking [43], human pose recovery [18], etc. There are many video object segmentation approaches based on deep learning have also been proposed. For instance, in [31], a method has been developed for object segmentation in videos by using a convolutional neural network trained with static images only. In [4], Caelles et al. have proposed the one-Shot video object segmentation approach, based on a fully-convolutional neural network architecture. This approach is able to successively transfer generic semantic information to the task of foreground segmentation, and finally to learn the appearance of a single annotated object of the test sequence. In [29], an efficient video object segmentation approach has been presented based on a deep Siamese encoder-decoder network that is designed to take the advantage of mask propagation and object detection while avoiding the weaknesses of both approaches. Although segmentation based on deep learning methods have demonstrated good results, the involved models often contain lots of parameters that require designated training steps and a large amount of time to learn.

During the last decade, various video object segmentation approaches based on saliency detection models have been developed [13, 23, 25, 33, 34, 40, 41, 47] in which an explicit concept of how a foreground object looks like is formulated in the given video sequence. The basic idea of this kind of approach is that normally we are interested only in some particular regions of a given video. These regions correspond to noticeable objects that most attract users’ interest and best represent the content of the video. Thus, in these video object segmentation approaches, saliency detection models are used to locate object-like regions from each frame. The problem of video object segmentation has then been recast as the problem of object region selection. The objectiveness (i.e. the possibility of each pixel in the frame to be an object of interest) is measured based on both motion and appearance based cues. The region of interest in a video frame is considered to be an object of interest if it has a high similarity across the frames with high objectiveness. Consequently, the obtained object of interest would provide a reliable priori information for the video object segmentation task. However, as indicated in [40], one major limitation of this kind of segmentation approach is that the dependency of the extracted potential regions of interest in adjacent frames is not taken into account which would downgrade the video segmentation performance. Moreover, it is very difficult to define a precise boundary between the object and the background in the saliency map.

Model-based segmentation algorithms have also been proposed in the past few decades and drawn considerable attention continuously. One of the most representative model-based segmentation methods is based on mixture models [26]. Various mixture models have been applied to different problems, ranging from visual scenes categorization [8], video background subtraction [9], gene expression clustering [11], to image segmentation [10, 19]. The main advantage of mixture model-based approaches is that they can incorporate the prior knowledge to model unknown uncertainties in a probabilistic manner. Although conventional mixture model (such as Gaussian mixture model) is efficient for segmentation, it does not take the spatial information between neighbouring pixels into account, which results in that the segmentation performance is quite sensitive to noise. In recent years, mixture models based on Markov random field (MRF) have received great attention in image segmentation [6, 15, 28]. In these approaches, in order to reduce the segmentation sensitivity to noise, the prior distribution of the calculated pixels is related to the corresponding parameters of its neighboring pixels.

Inspired from aforementioned works, we propose an unsupervised video object segmentation approach that is based on saliency detection and mixture models. In our approach, we first obtain saliency maps for input frames by extracting the spatial static edges in the same frame and the estimated motion boundary edges between adjacent frames. Next, potential regions of interest are generated according to the self-adaptive method and the object of interest is located. Then, the information of the obtained object of interest is used as the prior and is incorporated into the Gaussian mixture model with MRF to acquire accurate pixel-wise segmentation results. The contributions of our work can be summarized as follows. First, an unsupervised video object segmentation approach is developed based on saliency detection, Gaussian mixture models and MRF. Second, in order to identify and extract the region of the target object (i.e. the object of interest) in a given video among several candidate regions, a method of identifying the region of the object of interest is proposed. Third, the effectiveness of the proposed unsupervised video object segmentation approach is validated through experimental results using both the SegTrack [36] and SegTrack v2 [21] data sets .

The rest of this work is organized as follows. In Sect. 2, we provide details of our unsupervised video object detection approach. In Sect. 3, experiments conducted on the SegTrack and SegTrack v2 data sets are used to evaluate the proposed segmentation approach. Finally, conclusion is presented in Sect. 4.

Fig. 1
figure 1

Overview of the proposed unsupervised video object segmentation method

2 The Proposed Approach

2.1 The Framework

The framework of our method can be mainly divided into three steps. First, a spatiotemporal saliency map of the input frame is obtained by extracting the spatial static edges in the same frame and the estimated motion boundary edges between adjacent frames. Second, potential regions of interest are generated according to the self-adaptive method and the object of interest is located. Third, the information of the obtained object of interest is used as the prior and is incorporated into the Gaussian mixture model with MRF to acquire accurate pixel-wise segmentation results. The overview of the proposed unsupervised video object segmentation method is shown in Fig. 1.

2.2 Saliency-Aware Segmentation

The method that we used to calculate the saliency map of the input frame is mainly based on [40]. The framework for obtaining the saliency map is shown in Fig. 2. First, input frames are partitioned into superpixels through the SLIC superpixel method [1]. Then, based on the discontinuity of color and motion, we compute the edge probability and the optical flow to extract two types of edges: the spatial static edge in the same frame and the estimated motion boundary edges between adjacent frames. After that, we combine the two edge maps into a spatiotemporal edge probability map. Based on the probability map, the intra-frame graph and inter-frame graph are constructed to calculate the object probability of each super pixel, thereby obtaining the saliency of the current frame.

Fig. 2
figure 2

The framework for obtaining the saliency map of a given input frame

2.2.1 SLIC Superpixel Method

In our work, we adopt the SLIC method [1] to compute superpixels from each frame. Superpixel methods exploit the similarity between the features of the pixels to group these pixels, and use a small number of superpixels instead of a large number of pixels to express the image features. The motivation of using the SLIC superpixel method in our segmentation method is that, the superpixels formed by SLIC are more compact and can better maintain the original outline of the target object. Moreover, it has fewer parameters and faster computational time than many existing superpixel methods. The main steps of the SLIC method are described as follows: (1) initialize the seed point (cluster center); (2) reselect the seed point in the neighborhood of the seed point: calculate the gradient value of all the pixels in the neighborhood, and move the seed point to the place with the smallest gradient in the neighborhood; (3) assign a class label to each pixel in the neighborhood around each seed point; (4) calculate the distance D between the pixel and the seed point, including the color distance and the spatial distance. The seed point corresponding to the minimum value of the distance is taken as the cluster center of the pixel; (5) the iterative optimization is performed until the cluster center of each pixel no longer changes. The distance in step 4 is calculated by

$$\begin{aligned} d_c= & {} \sqrt{(r_j-r_i)^2+(g_j-g_i)^2+(b_j-b_i)^2} \end{aligned}$$
(1)
$$\begin{aligned} d_s= & {} \sqrt{(x_j-x_i)^2+(y_j-y_i)^2} \end{aligned}$$
(2)
$$\begin{aligned} D= & {} \sqrt{\bigg (\frac{d_c}{Nc}\bigg )^2+\bigg (\frac{d_s}{N_s}\bigg )^2} \end{aligned}$$
(3)

where \(d_c\) represents the color distance between the pixel i and the seed point j in terms of RGB color information, \(d_s\) represents the spatial distance, \(N_s\) denotes the maximum spatial distance within the class, and \(N_c\) is the maximum color distance.

2.2.2 Intra-Frame Graph Construction

For the kth frame, an undirected weighted intra-frame graph \(G^k\) is constructed by considering superpixels within the kth frame as the nodes. The weight between two nodes in the kth frame is denoted as \(W_{mn}^k\). In this framework, an intra-frame graph is constructed to represent the foreground probability map for locating foreground object and the geodesic distance [40] of the shortest path between two superpixels on the image is used to calculate the objectiveness of each superpixel. This is mainly based on the assumption that object region normally has a high spatiotemporal edge value or is surrounded by an area with a high spatiotemporal edge value. For each superpixel \(y_n^k\) in the kth frame, the probability that \(y_n^k\) belongs to the foreground object is calculated by

$$\begin{aligned} p^k(y_n^k) = \mathop {\min }_{t\in T^k} d_{geo}(y_n^k,t,G_k) \end{aligned}$$
(4)

where \(T^k\) indicates the superpixels along the four boundaries of the kth frame. The geodesic distance between any two nodes (i.e. superpixels) \(v_1\) and \(v_2\) in graph \(G_k\) is defined by

$$\begin{aligned} d_{geo}(y_n^k,t,G_k)=\mathop {\min }_{C_{v_1,v_2}}\sum _{mn}W_{mn}^k,\quad mn \in C_{v_1,v_2} \end{aligned}$$
(5)

where \(C_{v_1,v_2}\) denotes a path connecting the nodes \(v_1\) and \(v_2\). The weight \(W_{mn}^k\) is defined by

$$\begin{aligned} W_{mn}^k = \Vert E^k(y_m^k)-E^k(y_n^k)\Vert \end{aligned}$$
(6)

where \(E^k(y_m^k)\) and \(E^k(y_n^k)\) denote the spatiotemporal boundary probability of superpixels \(y_m^k\) and \(y_n^k\), respectively.

2.2.3 Inter-Frame Graph Construction

We construct an undirected weighted inter-frame graph \(G'^k\) for each pair of the kth frame and the \(k+1\)th frame by treating all the superpixels in these two frames as the nodes. Two kinds of edges are defined: all the spatially adjacent superpixels are connected by intra-frame edges whereas all the temporally adjacent superpixels are linked by inter-frame edges. The edge weights are specified as the Euclidean distance between the average colors in the CIE-Lab color space.

For the kth frame, a self-adaptive threshold \(\sigma ^k\) for decomposing the kth frame into object-like regions and background regions is calculated through the average of the probability map \(p^k\). Therefore, the object-like regions \(F^k\) and background regions \(B^k\) in the kth frame are defined as:

$$\begin{aligned} F^k = \{y_n^k|p^k(y_n^k)>\sigma ^k\} \cup \{y_n^k|y_n^k \text { is temporally connected to } F^{k-1}\} \end{aligned}$$
(7)

and

$$\begin{aligned} B^k = Y^k-F^k \end{aligned}$$
(8)

Then, in the inter-frame graph, the saliency of the kth frame is calculated by

$$\begin{aligned} S^k(y_n^k) = \mathop {\min }_{b\in B^k\cup B^{K+1}}d_{geo}(y_n^k,b,G'_k) \end{aligned}$$
(9)

Finally, a saliency map is obtained by calculating the saliency of each superpixel.

2.3 Object Region Extraction

After applying the saliency detection method to calculate the saliency map for the input frame, the saliency map acquires the background and foreground labels of the frame according to the self-adaptive threshold. However, multiple object regions may be located in the extraction result because the camera moves following the object’s movement, and another active object or dynamic background environment may also appear in the video. In order to identify and extract the region of the object of interest in the video among several candidate regions, a method of identifying the region of the object of interest is proposed in this section.

Fig. 3
figure 3

The example of detecting object of interest in the ith frame based on the object of interest region that was found in the \(i-1\)th frame as the reference

The main assumption of our method for identifying the region of the object of interest in the saliency map is that, there is no significant movement of the camera for the first several frames (i.e., normally 5–10 frames) of the video, and thus only the primary active object will be detected. Therefore, we will treat the first frame that only contains the object of interest as the reference for the following frames in order to refine the saliency map to have only the target object. The object detected in the current frame may change in shape and position compared to the one detected from the previous frame, but the displacement of the centroid (i.e., the center position) of the object region in two consecutive frames is not significant due to the limited time period. Based on this idea, when multiple object regions are detected in one frame, we calculate the Euclidean distance from the centroid of each object detected in this frame to the centroid of the object of interest found in the previous frame (i.e., the reference frame), and then consider the object with the minimum value of the distance as the true object of interest. For example, as shown in Fig. 3, if J object regions (\(r_1,\ldots , r_J\)) were found after applying the self-adaptive threshold for the ith frame, then we can use the object of interest region that was found in the \(i-1\)th frame as the reference to locate the object of interest region in the ith frame. Specifically, in order to locate the correct object of interest among all obtained objects that were found in the ith frame, we first compute the centroid \(c_j\) for each object region as

$$\begin{aligned} c_j^{(i)}(x_j,y_j)=\bigg (\frac{\sum _{n=1}^{N_j}x_{nj}}{N_j}, \frac{\sum _{n=1}^{N_j}y_{nj}}{N_j}\bigg ) \end{aligned}$$
(10)

where \((x_j,y_j)\) represents the coordinates of the centroid of the jth object region, \(N_j\) denotes the number of pixels in the object region j, \(x_{nj}\) and \(y_{nj}\) indicate the position of the nth pixel in the jth object region. Next, we calculate the Euclidean distance between the centroid of object of interest \(c_p\) that we found in the previous frame with each centroid of the object region obtained in the ith frame as

$$\begin{aligned} c_j^{(i)}(x_j,y_j)-c_p^{(i-1)}(x_p,y_p)=\sqrt{(x_j-x_p)^2+(y_j-y_p)^2} \end{aligned}$$
(11)

where \((x_p,y_p)\) denotes the position of the object of interest region of the reference frame. Then, the object of interest region \(p^{i}\) in the ith frame corresponds to the one with the smallest distance among all J regions

$$\begin{aligned} p^{(i)}= \arg \mathop {\min }_j\big [c_j^{(i)}(x_j,y_j)-c_p^{(i-1)}(x_p,y_p)\big ] \end{aligned}$$
(12)

The object of interest in the \(i+1\)th frame is obtained using the same fashion.

2.4 Segmentation Via Gaussian Mixture Model

In this step, Gaussian mixture model is adopted to perform object segmentation for input frames by taking the detected object of interest in the previous step as the prior information. We use the Gaussian mixture model to model the feature vectors of the object region and the background region in each frame and divide them into two classes of labels (foreground and background). These feature vectors represent the information of the video (such as pixel value, location coordinate, etc.) where labels and pixels are independent of each other. In our method, feature vectors in the foreground and background are subjected to multiple Gaussian distributions, which are weighted and linearly combined together as a mixture model in order to model the foreground and background. Assuming that there are K different regions in background B and the vector in the jth region obeys the Gaussian distribution \({\mathcal {N}}(B|\mu _j,\sigma _j)\) with parameters \(\mu _j\) (mean) and \(\varSigma _j\) (covariance matrix). Then, both the background region of the video and the foreground object region can be expressed by mixtures of K Gaussian distributions. Therefore, in order to have the probability of object region and background, it is necessary to infer the parameters of Gaussian mixture model. Specifically, the Gaussian mixture model representing the background B is given by

$$\begin{aligned} p(B)=\sum _{j=1}^K\pi _j {\mathcal {N}}(B|\mu _j,\varSigma _j) \end{aligned}$$
(13)

where \({\mathcal {N}}(B|\mu _j,\varSigma _j)\) is the Gaussian distribution associated with the jth component of the mixture model. The parameters \(\pi _{j}\) in Eq. (13) are called mixing coefficients which must satisfy the following constraints

$$\begin{aligned} 0\le \pi _{j}\le 1 \quad \text {and} \quad \sum _{j=1}^{K}\pi _{j}=1 \end{aligned}$$
(14)

Then, the likelihood function of the Gaussian mixture model is given by

$$\begin{aligned} L(B)=\sum _{n=1}^N\log \sum _{j=1}^K\pi _j{\mathcal {N}}(B_n|\mu _j,\varSigma _j) \end{aligned}$$
(15)

where N denotes the total number of pixels in the background B. The parameters of the Gaussian mixture model are obtained by maximizing the likelihood function as

$$\begin{aligned} (\mu ^*,\varSigma ^*, \pi ^*) = \arg \mathop {\max }_{\mu ,\varSigma , \pi } L(B) \end{aligned}$$
(16)

where the optimal parameters \(\mu ^*\), \(\varSigma ^*\), \(\pi ^*\) can be obtained by using the expectation maximization (EM) algorithm [27]. Then, based on the prior information obtained from the previous step and the Gaussian mixture models defined for the background and the foreground, the foreground and background probabilities based on color and location information are calculated respectively for the pixels of the original input frames. When the difference between the probabilities of foreground and background is greater than 0, the label of the pixel is set to foreground, otherwise it denotes background.

Fig. 4
figure 4

Sample segmentation results obtained by the Gaussian mixture model for the “monkeydog” video of the SegTrack data set

2.5 Markov Random Field

Although the Gaussian mixture model is an effective approach for segmentation, its segmentation results may contain noise, for instance as shown in Fig. 4. This is due to the fact that the image segmentation based on the Gauss mixture model considers pixels separately, and does not take the spatial relationship between nearby pixels into account. In order to tackle this problem, we adopt Markov random field (MRF) [14] to redefine the segmentation results. MRF considers that spatial information between pixels can distinguish different texture distributions, and effectively solves the problem of noise.

In our case, we apply the pairwise potential MRF to redefine the segmentation results to improve the segmentation accuracy [16]. The frame is represented by an array \(X=(x_1,x_2,\ldots ,x_N)\), \(x_n\) represents the pixel value at pixel n, and the image segmentation result is represented by an array \(Y=(y_1,y_2,\ldots ,y_N)\), where \(y_n\in (0,1)\) such that 0 represents the background and 1 denotes the foreground. In MRF, the image segmentation problem is summarized as the optimization problem of the MRF-Gibbs energy function.

The unary potential is calculated by the Gaussian mixture model, which indicates whether the pixel belongs to the category (background or foreground) described by the model. The calculation of potential energy is performed using an isotropic second-order neighborhood system (eight neighborhoods), indicating the consistency of the type of two pixel points. The unary potential and pairwise potential are defined by

$$\begin{aligned} E_{unary}(X,Y)= \sum _{n=1}^NU_n(y_n|x_n) \end{aligned}$$
(17)

and

$$\begin{aligned} E_{pairwise}(X,Y)=\sum _{n=1}^N\sum _{j\in \varepsilon _n}V_{nj}(y_n,y_j|x_n,x_j) \end{aligned}$$
(18)

where \(U_n(y_n|x_n)\) is the negative logarithm of the probability of \(y_n\). \(\varepsilon \) indicates the neighborhoods of the system, \(V_{nj}\) is obtained by calculating the difference between the pixel n and the jth neighboring pixel. The closer the two pixels are, the larger the potential energy is.

3 Experimental Results

In this work, we propose an unsupervised method that can automatically detect and extract the moving objects in video sequences. The goal of this section is to validate the proposed method by conducting experiments on the SegTrack data set [36] and the SegTrack v2 data set [21]. We also compare our segmentation results with several other video segmentation methods to demonstrate its advantages. All experiments were conducted using Matlab and tested using a PC with Windows platform (Core i7, running at 2.78 GHz with 32 GB of RAM).

3.1 Experiments on the SegTrack Data Set

We first conducted our experiments on the SegTrack data set, which is a popular video segmentation data set with full pixel-level annotations on multiple objects at each frame within each video. Originally, the SegTrack data set contains six video sequences. Since one of the videos (“penguin”) that does not have the ground-truth, it is not considered in our experiments. Therefore, five video sequences in total with different characteristics from the SegTrack data set were used in our experiments including “birdfall” (small object), “cheetah” (object with fast motion patterns), “girl” (object with large shape deformation), “monkeydog” (object with large camera motion) and “parachute” (object with color overlap). We applied a similar setting as in [46], such that videos were divided into two categories: videos with static camera and with dynamic camera, respectively. For the video with static camera (the “birdfall” video), we used background subtraction to extract the moving target region in the video. For videos with the dynamic camera, we used the saliency segmentation and object of interest extraction method as we introduced in Sects. 2.2 and 2.3 to extract the region of the object of interest. After the region of the object of interest was detected, the extracted object region was optimized and redefined by using the Gaussian mixture model and MRF. In our case, the Gaussian mixture model was initialized with the k-means algorithm with the number of centers set to 10.

Fig. 5
figure 5

The result of the segmentation for the “birdfall” video. The first row illustrates the original input frames, the second row shows the ground-truth, and the third row demonstrates the segmentation results using the proposed method

Fig. 6
figure 6

The result of the segmentation for the “cheetah” video. The first row illustrates the original input frames, the second row shows the ground-truth, and the third row demonstrates the segmentation results using the proposed method

Fig. 7
figure 7

The result of the segmentation for the “girl” video. The first row illustrates the original input frames, the second row shows the ground-truth, and the third row demonstrates the segmentation results using the proposed method

Fig. 8
figure 8

The result of the segmentation for the “monkeydog” video. The first row illustrates the original input frames, the second row shows the ground-truth, and the third row demonstrates the segmentation results using the proposed method

Fig. 9
figure 9

The result of the segmentation for the “parachute” video. The first row illustrates the original input frames, the second row shows the ground-truth, and the third row demonstrates the segmentation results using the proposed method

Fig. 10
figure 10

The segmentation results obtained by the proposed method for the SegTrack data set. The region within the red boundary corresponds to the object of interest in the video. (Color figure online)

Fig. 11
figure 11

The segmentation results obtained by different methods using the SegTrack data set with ground truth

The qualitative results for the proposed unsupervised video segmentation method are shown in Figs. 5, 6, 7, 8 and 9 for the SegTrack data set. Fig. 10 shows the segmented results with the red boundaries in original frames. Based on these results, we can observe that the proposed video segmentation method is able to successfully extract primary moving object in given video sequences. Next, a visual comparison of the proposed method with several other video segmentation methods which include [13, 40, 46] and [42], is provided in Fig. 11, where higher saliency probabilities are denoted by brighter pixels. As shown in this figure, the proposed method has better performance than other tested methods in terms of more accurately estimated saliency maps at pixel level within and on the contour of the objects in cluttered backgrounds. Another observation is that, the image saliency method [42] has obtained the worst performance among all tested methods, where the foreground objects cannot be precisely detected by saliency maps. This is due to the fact that the method of [42] does not take motion information into account and therefore results in degraded performance in locating object, especially when background and foreground have similar colors.

Furthermore, in order to quantitatively compare with other experimental results, we utilized the average per-frame pixel error rate for evaluation, which is the number of pixels misclassified comparing to the ground-truth segmentation [46], and is defined by

$$\begin{aligned} Error = \frac{XOR(m, GT)}{M} \end{aligned}$$
(19)

where m is the final result of each frame segmentation, and GT is the ground-truth segmentation result of the video, and M is the total number of frames in the video. The average per-frame pixel error rate can effectively estimate the approximation between segmentation results and the corresponding ground-truth. The smaller the error, the closer the segmentation result is to the ground truth.

Table 1 The average per-frame pixel error rate using the SegTrack data set by different segmentation methods

We compare the proposed method quantitatively with other recent video segmentation methods including [7] and [30, 37, 40, 46]. Among those tested methods, [30] and [40, 46] are unsupervised, while [7] and [37] are supervised (i.e. an initial annotation is required for the first frame). In our experiments, for the tested segmentation methods, we adopted the same settings as in their original works. The comparison results are shown in Table 1. According to this table, for most cases, the proposed method provided better segmentation performance than the tested methods in terms of lower average per-frame pixel error rates. We may notice that our method performed slightly worse than [40] for the “girl” sequence, this might be caused by the large shape deformation of the primary object (i.e., the running girl) and therefore degraded the performance of the object region extraction step in our method. However, as we can observe from Table 1, our method was able to obtain better performance than [40] for the other tested video sequences.

3.2 Experiments on the SegTrack v2 Data Set

To further demonstrate the effectiveness of the proposed unsupervised video object segmentation method, more experiments were conducted on the SegTrack v2 data set, which is an updated version of the SegTrack data set. In addition to the five video sequences in the SegTrack data set, eight new sequences are introduced in the SegTrack v2 data set including “frog”, “worm”, “soldier”, “monkey”, “bird of paradise”, “drifting car”, “hummingbird”, and “BMX”. Our method was compared with several state-of-the-art video object segmentation methods including [17] and [24, 29, 31, 38]. We report the experimental results by different methods on the SegTrack v2 data set in terms of the computational time and another segmentation evaluation metric namely Intersection over Union (IoU) which is defined by

$$\begin{aligned} IoU=\frac{\text {Segmentation result}\cap \text {Ground-truth}}{\text {Segmentation result}\cup \text {Ground-truth}} \end{aligned}$$
(20)
Table 2 The segmentation performance in terms of the IoU metric and the computational time on the SegTrack v2 data set by different methods

The experimental results of our method and other tested ones are shown in Table 2 in terms of the IoU metric and the computational time for segmenting one frame. As we can see from this table, the proposed method is able to obtain the highest IoU value among all tested methods. We may also notice that although our method is relatively slower than [17, 29] and [24], it is significantly more computationally efficient than [4] and [31, 38].

4 Conclusion

In this paper, an unsupervised video object segmentation approach was proposed based on the Gaussian mixture model with MRF. In our approach, a saliency detection method was developed to locate the object of interest. The developed saliency detection method can calculate the saliency of each frame in the video sequence and extract the position and region of the object of interest with more accurate object boundary. The refined extracted object region was then taken as the prior information and incorporated into our Gaussian mixture model and MRF to obtain the precise pixel-wise segmentation result of each frame. The effectiveness of the proposed unsupervised video object segmentation approach was validated by conducting experiments on both the SegTrack and the SegTrack v2 data sets.