1 Introduction

Salient object detection aims to extract the attractive objects in images and videos [5]. It can support many robotics tasks, such as object detection [12, 38], action recognition [11, 26], and scene analysis [10, 50]. It can also serve as a fundamental of various multimedia applications, including image/video classification [4, 29, 48], summarization [28, 34], quality assessment [27, 47], retrieval [8, 16, 41, 52], content-aware editing [32, 33, 39], and social media analysis [3, 42]. As compared to the prosperity of image-based salient object detection [5], efficiently detecting salient objects in videos still faces many challenges.

A primary idea for video-based salient object detection is to apply the existing image-based methods independently on each video frame and generate saliency maps frame by frame. However, such a solution suffers from several problems. First, image-based salient object detection methods only consider spatial difference among different image regions, such as color contrast, but ignore their temporal difference, which is important in video-based salient object detection by providing object motion cue. Second, serious incoherence exists among the generated saliency maps when processing each frame independently; this is because the appearances of objects and background may be quite diverse on different frames. Finally, obvious content redundancy exists in videos because successive video frames need to contain sufficiently similar content to provide smooth viewing experience. It leads to unnecessary computational cost if ignoring video content redundancy simply. Figure 1 shows an example of the difference between image-based and video-based salient object detection, in which Fig. 1b is the result generated by applying a typical image salient object method independently on each video frame and Fig. 1c is the result of our method. We can see that video-based salient object detection is superior to image-based salient object detection in emphasizing salient objects and suppressing background.

Fig. 1
figure 1

An example of visual comparison between image-based and video-based salient object detection. a Video frames. b Result of image-based salient object detection by applying [7] independently on each video frame. c Result of video-based salient object detection by using our method

To tackle these problems, we propose a novel video-based salient object detection method, which explores the potential of both spatio-temporal difference and spatio-temporal coherence of video content. Figure 2 shows an overview of our proposed method. Based on the super-pixel representation of video frames, we first calculate the saliency values of super-pixels on the keyframes based on spatial difference, i.e., color contrast among adjacent super-pixels in the same keyframe, and temporal difference, i.e., object motion extracted from adjacent frames. Second, we construct the relationships between the super-pixels in the same frame or adjacent frames based on their color similarity and motion vector. Next, we propagate saliency values from the keyframes to non-keyframes and improve saliency coherence on all the frames with the constraint of spatio-temporal coherence, i.e., making the similar super-pixels in the same frame and adjacent frames have coherent saliency values. Finally, we globally normalize all the saliency maps and obtain the salient object detection result of the given video. Some preliminary results of our method were presented in [17]. In this paper, we improve the saliency propagation mechanism in our proposed method, which obtains better salient object detection performance. Moreover, we supplement the performance analysis of the key parameters in our method and provide more comprehensive evaluation on two public datasets.

Fig. 2
figure 2

An overview of our method. A given video is first represented with super-pixels and their relationships. Next, the saliency maps on keyframes are generated based on spatio-temporal difference. After this, the saliency values are propagated from the keyframes to non-keyframes with the constraint of spatio-temporal coherence, and the saliency coherence within each salient object and background on all the frames is improved during saliency propagation. Finally, the salient object detection result of the given video is generated after global normalization

Our major contributions mainly include: First, we present a novel video-based salient detection method, which can detect both static and moving salient objects effectively based on spatial-temporal difference. Second, we make use of video content redundancy to improve the efficiency of our method by combining salient object detection on keyframes and saliency propagation among frames.

2 Related work

2.1 Saliency cues

Most exiting salient object detection methods focus on dealing with various images and videos without the constraints of specific applications. Hence, they use general saliency cues, including color, depth, location and motion, rather than specific cues, such as vehicle detection in surveillance video analysis. Color cue, especially color contrast, is widely used in both image-based and video-based salient object detection. Color contrast can be measured globally or locally based on different features, such as average color and color histogram. Specifically, Achanta et al. [1] calculated the saliency value of each pixel according to the difference between its color and the average color of the whole image. Cheng et al. [7] decomposed a given image into regions and measured the saliency value of each image region with global color contrast weighted spatial distance.

Depth is an effective cue for salient object detection on RGB-D images and videos [9]. Depth contrast can distinguish salient objects to their surrounding background even they have similar colors [35], which performs well as the only feature in salient object detection [20]. Moreover, depth prior, i.e., assigning higher saliency values to near regions, is also effective in detecting salient object by combining with other features [21, 36].

Content location is utilized as a supplement in saliency detection, especially on natural images [40]. Both center-bias [7] and boundary-bias [51] can improve the performance salient object detection effectively. It is because salient objects are easily placed near to image center in photography and the objects near to image center attract more attention.

Motion is an important cue in salient object detection on videos, because moving objects easily attract viewer’s attention [49]. To remove the influence of camera motion, motion contrast is used to represent object motion based on block-level motion vector fields [13, 18] or optical flow [30, 44].

Recently, the features extracted by deep neural network show their outstanding performance in many applications [25, 45]. It is also brought in salient object detection. For example, Li et al. [22] proposed an end-to-end deep contrast network, which consists of a pixel-level fully convolutional stream and a segment-wise spatial pooling stream, to general pixel-level saliency maps. Hou et al. [15] presented a top-down method by bringing short connections into skip-layer structures within the architecture of Holisitcally-Nested Edge Detector, which achieved accurate saliency detection results based on the reasonable combination of both low-level and high-level features.

2.2 Saliency coherence

A prominent problem in salient object detection is difficult to keep coherent saliency within each salient object. It hampers successive applications of salient object detection, such as image/video editing.

Graph-based models, such as random walk [19] and manifold ranking [46], are used to address the problem by propagating saliency values. For example, Qin et al. [37] utilized cellular automata for saliency map optimization, which shows robust to input saliency maps generated by different methods. Liu et al. [30] used both forward and backward saliency propagation based on inter-frame similarity matrices to improve temporal coherence.

Another solution of saliency coherence improvement is to treat saliency maps as the input of segmentation algorithms and generate binary saliency maps. Such a technique is named saliency cuts. Saliency cuts methods may utilize only saliency maps or saliency maps together with original images as their input. For instance, Achanta et al. calculated a threshold from saliency value distribution of the input saliency map and binarized the saliency map according to the threshold [1]. Li et al. produced segmentation seeds using adaptive triple thresholding, and fed the seeds to GrabCut algorithm [24].

3 Our method

3.1 Video representation

To a given video, we first over-segment each frame F t into super-pixels using simple linear iterative clustering (SLIC) algorithm [2] in order to reduce computational cost while retaining the intrinsic structure of video content. Based on the over-segmentation, we can represent F t with a set of super-pixels \(F^{t}= \left \{{p^{t}_{1}},{p^{t}_{2}},\ldots ,p^{t}_{N_{t}}\right \}\), where \({p^{t}_{i}}\) is a super-pixel on F t and N t is the number of super-pixels on F t. Inspired by [7], we extract a 123-bin color histogram \(\boldsymbol {h}^{t}_{i}\) on L*a*b* color space from each super-pixel \({p^{t}_{i}}\), and retain the top-85 dominant bins in each histogram to improve efficiency.

Based on the super-pixel representation of video content, We construct the relationships between super-pixels according to their similarities. We first define the relationship between two super-pixels on the same frame. Suppose that \({p^{t}_{i}}\) and \({p^{t}_{j}}\) are on F t, we define their relationship based on their color similarity weighted by spatial distance:

$$ r^{t}_{i,j} = \left( 1 - ||\boldsymbol{h}^{t}_{i} - \boldsymbol{h}^{t}_{j}||_{2}\right)\cdot\exp\Big(-{d^{t}_{i,j}}^{2}\Big), $$
(1)

where \(\boldsymbol {h}^{t}_{i}\) and \(\boldsymbol {h}^{t}_{j}\) are the color histograms of \({p^{t}_{i}}\) and \({p^{t}_{j}}\), respectively; ||.||2 denotes Euclidean distance; \(d^{t}_{i,j}\) is the spatial distance between the centers of \({p^{t}_{i}}\) and \({p^{t}_{j}}\), which is divided by the length of image diagonal for normalization. To avoid the cumulative effect of small \(r^{t}_{i,j}\) values, we filter the small \(r^{t}_{i,j}\) values with a threshold τ, i.e., \(r^{t}_{i,j}\) is set to 0 if \(r^{t}_{i,j}\) is smaller than τ. In our experiments, τ is set to 0.3.

We also define the relationship between two super-pixels on adjacent frames. Suppose that \({p^{t}_{i}}\) and \(p^{t+1}_{j}\) are on the adjacent frames F t and F t+1 respectively, we seek a matching super-pixel \(p^{t+1}_{i^{\prime }}\) for \({p^{t}_{i}}\) on F t+1 using object tracking strategy:

$$ p^{t+1}_{i^{\prime}} = \underset{p^{t+1}_{k}\in \Omega^{t+1}_{i}}{\arg\min}||\boldsymbol{h}^{t}_{i} - \boldsymbol{h}^{t+1}_{k}||_{2}, $$
(2)

where \(\boldsymbol {h}^{t}_{i}\) and \(\boldsymbol {h}^{t+1}_{k}\) are the color histograms of \({p^{t}_{i}}\) and \(p^{t+1}_{k}\), respectively; \(\Omega ^{t+1}_{i}\) is a surrounding region on F t+1 with the same center coordinate to that of \({p^{t}_{i}}\); \(p^{t+1}_{k}\) is a super-pixel whose center is located within \(\Omega ^{t+1}_{i}\). In our experiment, the size of \(\Omega ^{t+1}_{i}\) is set to 64 × 64 pixels. Assisted by \(p^{t+1}_{i^{\prime }}\), we define the relationship between \({p^{t}_{i}}\) and \(p^{t+1}_{j}\) as follows:

$$ r^{t,t+1}_{i,j} = \left( 1 - ||\boldsymbol{h}^{t}_{i} - \boldsymbol{h}^{t+1}_{j}||_{2}\right)\cdot\exp\Big(-{d^{t+1}_{i^{\prime},j}}^{2}\Big), $$
(3)

where \(\boldsymbol {h}^{t}_{i}\) and \(\boldsymbol {h}^{t+1}_{j}\) are the color histograms of \({p^{t}_{i}}\) and \(p^{t+1}_{j}\), respectively; \(d^{t+1}_{i^{\prime },j}\) is the normalized distance between the centers of \(p^{t+1}_{i^{\prime }}\) and \(p^{t+1}_{j}\). Similarly, we filter the small \(r^{t,t+1}_{i,j}\) values if they are smaller than the threshold τ, which equals 0.3 in our experiments.

3.2 Salient object detection on keyframes

Referring to [30], the dominant time cost in video-based salient object detection is caused by optical flow estimation, which plays a significant role in saliency calculation in videos. To achieve high efficiency, we explore the content redundancy of video frames by calculating saliency values of super-pixels directly on several keyframes and generating saliency maps for other video frames with saliency propagation. In this way, we can elide optical flow estimation on a major proportion of video frames and improve the efficiency of our method. Though there have been amounts of methods for video keyframe extraction, we simply use uniform sampling because of efficiency requirement, i.e., sampling a keyframe every k video frames. In our experiments, k is set to 4.

On each keyframe, we detect a saliency map based on spatio-temporal difference. In saliency calculation based on spatial difference, we calculate the saliency value of each super-pixel \({p^{t}_{i}}\) using color contrast with boundary connectivity [14] as follows:

$$ {c^{t}_{i}} = \sum\limits^{N_{t}}_{k=1}{r^{t}_{i,k}\cdot \left( 1-\exp\left( -{{B^{2}_{k}}}/{2}\right)\right)}, $$
(4)

where \({c^{t}_{i}}\) is the saliency value of \({p^{t}_{i}}\) calculated based on color contrast; B k is the boundary connectivity strength of \({p^{t}_{k}}\), which denotes the length ratio of \({p^{t}_{j}}\)’s edge on image boundary to its whole edge; N t is the number of super-pixels on frame F t.

In saliency calculation based on temporal difference, we estimate optical flow using large displacement optical flow algorithm [6] to obtain pixel-level motion vector, and further calculate the saliency value of each super-pixel \({p^{t}_{i}}\) as follows:

$$ {o^{t}_{i}} = 1 - \left|\left|\boldsymbol{m}^{t}_{i} - \boldsymbol{m}^{t}_{g}\right|\right|_{2}, $$
(5)

where \({o^{t}_{i}}\) is the saliency value of \({p^{t}_{i}}\) calculated based on object motion; \(\boldsymbol {m}^{t}_{i}\) is the normalized motion vector of \({p^{t}_{i}}\) with eight uniform intervals from [−π,π]; \(\boldsymbol {m}^{t}_{g}\) represents the global motion on F t, which is calculated as follows:

$$ \boldsymbol{m}^{t}_{g} = \frac{1}{N_{t}}\sum\limits^{N_{t}}_{k=1}{\boldsymbol{m}^{t}_{k}\cdot{\left( 1-\exp\left( -{{B^{2}_{k}}}/{2}\right)\right)}}, $$
(6)

where B k and N t are same to that in (4).

We linearly combine the saliency values of each super-pixel calculated based on color contrast and object motion:

$$ {s^{t}_{i}} = \alpha\cdot {c^{t}_{i}} + (1-\alpha)\cdot {o^{t}_{i}}, $$
(7)

where α is a parameter to emphasize the effect of object motion, which is set to 0.3 in our experiments.

3.3 Saliency propagation

Once the saliency maps are generated on all the keyframes, we propagate saliency values from keyframes to their adjacent frames and further to other frames. To avoid unconstrained increasing of saliency values in propagation, we keep the sum of the saliency values propagated from each super-pixel constant; otherwise, all super-pixels will have high saliency values after sufficient times of saliency propagation. Hence, we define the propagation weight between two super-pixels \({p^{t}_{i}}\) and \(p^{*}_{j}\) on adjacent frames as follows:

$$ \omega^{t,*}_{i,j} = \frac{r^{t,*}_{i,j}}{1+{\sum}^{N_{t-1}}_{m=1}{r^{t,t-1}_{i,m}}+{\sum}^{N_{t+1}}_{n=1}{r^{t,t+1}_{i,n}}}, $$
(8)

where ∗ can be set to t − 1 or t + 1; \(r^{t,t-1}_{i,m}\) and \(r^{t,t+1}_{i,n}\) are the relationship scores between \({p^{t}_{i}}\) and a super-pixel on F t−1 and F t+1, respectively; the “1” in the denominator of \(\omega ^{t,*}_{i,j}\) denotes the ratio of the retained saliency value for \({p^{t}_{i}}\) in propagation.

We also propagate saliency values among the super-pixels on the same frame to improve saliency coherence within each salient object. The propagation weight between two super-pixels \({p^{t}_{i}}\) and \({p^{t}_{j}}\) is defined as follows:

$$ \omega^{t}_{i,j} = \frac{r^{t}_{i,j}}{{\sum}^{N_{t}}_{k=1}{r^{t}_{i,k}}}, $$
(9)

where the ratio of the retained saliency for \({p^{t}_{i}}\) in propagation is set to 1, i.e., \(r^{t}_{i,i}\) is equal to 1.

According to (8) and (9), we calculate the saliency value of each super-pixel \({p^{t}_{i}}\) after once saliency propagation among the super-pixels on the adjacent frames and the same frame as follows:

$$ {s^{t}_{i}} = \mathbf{w}^{t}_{i}\mathbf{s}^{t}, $$
(10)

where \(\mathbf {w}^{t}_{i} = [\omega ^{t-1,t}_{1,i},\ldots ,\omega ^{t-1,t}_{N_{k-1},i},\omega ^{t}_{i,1},\ldots ,\omega ^{t}_{i,i},\ldots ,\omega ^{t}_{i,N_{k}},\omega ^{t+1,t}_{1,i},\ldots ,\omega ^{t+1,t}_{N_{k+1},i}]\) and \(\mathbf {s}^{t} = \left [s^{t-1}_{1},\ldots ,s^{t-1}_{N_{k-1}},{s^{t}_{1}},\ldots ,{s^{t}_{i}},\ldots ,s^{t}_{N_{k}},s^{t+1}_{1},\ldots ,s^{t+1}_{N_{k+1}}\right ]^{T}\). Specifically, \(\omega ^{t}_{i,i}\) is equal to \(\frac {1}{1+{\sum }^{N_{t-1}}_{m=1}{r^{t,t-1}_{i,m}}+{\sum }^{N_{t+1}}_{n=1}{r^{t,t+1}_{i,n}}}+\frac {1}{{\sum }^{N_{t}}_{k=1}{r^{t}_{i,k}}}\).

We iteratively propagate saliency values among super-pixels on all the video frames till reaching the predefined iteration number or stable saliency values of all the super-pixels. Because the initial saliency values of all the super-pixels on non-keyframes are set to zero and the sum of saliency values is constrained to be constant in propagation, the saliency values of the super-pixels within salient objects are not high enough after propagation. Meanwhile, some super-pixels in background on keyframes may be assigned high saliency values mistakenly in salient object detection on keyframes. It will leave residual saliency values to super-pixels in background after propagation. In order to enhance the saliency difference between salient objects and background, we globally normalize all the saliency maps and obtain the final salient object detection result for the given video.

4 Experiments

4.1 Datasets and experimental settings

We validated the performance of our method on two public datasets, named SegTrackV2 [23] and UVSD [30]. They contain 14 and 18 videos with manually labeled ground truth of salient objects on pixel level, respectively. Specifically, the videos in SegTrackV2 include various motion activities and scenes, while the videos in UVSD contain complicated motion and complex scenes. The diversity of these datasets increases the difficulty in salient object detection. We compared the proposed method with the state-of-the-art methods of video-based salient object detection: DCMR [18], GD [44], SGSP [30], SP [31] and SR [43].

All the experiments were conducted on a computer with Intel i5 2.8GHz CPU and 8GB memory. To all the other methods engaged in comparison, we used their default settings suggested by the authors, and normalized their generated saliency maps for a fair comparison.

4.2 Performance analysis

There are several key parameters influencing the performance of our proposed method, including the threshold for relationship filtering in video representation, the interval in keyframe selection and the combination weight of color contrast based saliency and object motion based saliency. We analyze the influence of these parameters as follows.

Threshold for relationship filtering

The filtering of small relationship values in (1) and (3) aims to avoid propagating the saliency of a super-pixel to its unrelated super-pixels, which hampers the effect of saliency map initialization. However, too strict relationship reduces the effect and robustness of saliency propagation, which may lead to the failure in generating high quality saliency maps, especially for the non-keyframes. We validate the performance of our method using different filtering thresholds from 0.1 to 0.6 with the step of 0.1. Figure 3 shows the validation results. It shows that the low filtering thresholds, such as 0.1 and 0.2, may cause slight decline of performance and the high filtering threshold, such as 0.6, will prevent saliency propagation among super-pixels. Hence, we choose 0.3 as the default filtering threshold in our experiments.

Fig. 3
figure 3

Performance of our method using different filtering thresholds. a Results on SegTrackV2. b Results on UVSD

Interval in keyframe selection

It is a trade-off between effectiveness and efficiency to determine a suitable interval in keyframe selection. Large interval leads to a small number of keyframes and low time cost of salient object detection on these keyframes, but brings in the risk of generating low quality saliency maps on non-keyframes. In contrast, small interval increases the number of keyframes, whose saliency maps usually have high quality based on direct salient object detection, but decreases the efficiency. We validate the performance of the proposed method with different keyframe selection intervals, namely 2, 4, 8, 16. Figure 4 shows the validation results and Table 1 shows the corresponding time cost. It shows that small selection interval (such as 2) may slightly improve performance but cause obvious increase in time cost, and large selection interval (such as 16) may cause serious performance degradation. We choose 4 as the default keyframe selection interval to make a trade-off between effectiveness and efficiency.

Fig. 4
figure 4

Performance of our method using different keyframe selection interval. a Results on SegTrackV2. b Results on UVSD

Table 1 Running time per frame of our method using different keyframe selection interval

Combination weight of color contrast based saliency and object motion based saliency

We combine the saliency values based on color contrast and object motion in (7) to generate saliency maps on keyframes. Object motion usually plays a more important role than color contrast in video-based salient object detection, because moving objects easily attract viewer’s attention. We validate the performance of our method using different combination weights from 0.1 to 0.5 with the step of 0.1. Figure 5 shows the validation results. It shows that smaller combination weight leads to better performance. However, both these two datasets, SegTrackV2 and UVSD, have some bias on spatio-temporal characteristics of video content, i.e., they focus on emphasizing the objects with obvious and complex motion. In order to handle different types of videos, such as nearly static videos, we choose 0.3 as the default combination weight in our experiments.

Fig. 5
figure 5

Performance of our method using different combination weights. a Results on SegTrackV2. b Results on UVSD

4.3 Experimental results

We compare our method with five state-of-the-art video-based salient object detection methods. The quantitative results and the qualitative results are shown in Figs. 6 and 7, respectively. From Fig. 6, we can see that our method has similar performance to SGSP, and outperforms other methods. An important reason that our method can obtain good performance is the exploration of both spatial and temporal characteristics of video content in salient object detection on keyframes and bidirectional saliency propagation. In comparison, the methods without using optical flow, such as DCMR and SR, have obviously worse performance than other methods because of the lack of relatively accurate temporal characteristics. Meanwhile, the effective saliency propagation strategy helps our method to obtain better performance than other methods using similar spatio-temporal characteristics of video content, such as GD and SP.

Fig. 6
figure 6

Comparison of different methods with PR curves. a Results on SegTrackV2. b Results on UVSD

Fig. 7
figure 7

Qualitative comparison of salient object detection results using different methods. a Video frames. b Ground truth. cg Results of DCMR [18], GD [44], SGSP [30], SP [31] and SR [43]. h Our results

In Fig. 7, the top three videos are from SegTrackV2, and the bottom three ones are from UVSD. We can see that our method can effectively detect nearly complete salient objects under complicated object motion and complex scenes. In comparison, other methods fail in emphasizing salient objects, such as the results of SR on the videos Diving and Waterski, and suppressing residual saliency in background, such as the results of SGSP on Monkey and Jogging.

Table 2 shows the running time per frame of different methods. We can see that the methods without using optical flow, such as DCMR and SR, require less time cost than other methods, but they have obviously worse performance (as shown in Fig. 6). As compared to the methods with similar effectiveness, such as SGSP and GD, our method is more efficient because it only needs to detect salient objects on keyframes. Therefore, our method outperforms the existing video-based salient object detection methods when taking account of both effectiveness and efficiency.

Table 2 Comparison with running time per frame of different methods

4.4 Discussion

We also find some limitations of our method in the experiments. Figure 8 shows an example of our failure results. In the example, our method fails in generate high quality saliency maps on non-keyframes because too complicated object motion prevents effect saliency propagation. In this situation, our method needs to use smaller keyframe selection interval for performance improvement.

Fig. 8
figure 8

A failure example of our method. a Video frames. b Our result consisted of low quality saliency maps

5 Conclusion

In this paper, we presented a video-based salient object detection method by fully exploring the potential of spatio-temporal difference and coherence in video content. Specifically, we detected salient objects on keyframes based on the combination of color contrast and object motion, and further propagate saliency values intra and inter frames. Finally, we generated the saliency maps with high saliency coherence for all the frames in a given video. The experimental results showed that our proposed method can achieve better salient object detection results with higher efficiency as compared to the state-of-the-art methods.

In future, we will focus on exploring more spatio-temporal characteristics of video content for salient object detection and improve the efficiency of our method by adaptive keyframe selection.