1 Introduction

The human visual system (HVS) can quickly select and focus on relevant areas. This selective mechanism is known as the visual attention mechanism, which has a variety of applications in action recognition [1], video summarization [2], video segmentation [3], image caption [4] and image quality assessment [5]. To simulate visual attention mechanism, two saliency related vision tasks including salient object detection [6,7,8,9,10,11,12,13,14] and saliency detection have been developed during recent years. Differ from salient object detection models aiming to segment salient objects in pixel level, saliency detection predicts human fixation region and can be easily calculated in the video processing [15]. With the fast development of deep learning technology and plenty of image fixation datasets, many image saliency detection models [16,17,18,19,20,21,22] have achieved significant successes.

Fig. 1
figure 1

(a) Video frames selected from UCF Sports [23]. (b) Ground truth (d) Optical flow. (e) Motion feature maps. (f) Fine-tuned motion feature maps

In contrast to image saliency detection, which acquires saliency cues only from a single image, video saliency detection additionally requires consideration of differences between multiple consecutive frames to infer saliency distribution. These differences across the temporal domain are generated by the combined motion of objects and camera. Human attention is more likely to be attracted to moving objects during free-viewing [24]. As shown in Fig. 1(a) and (b), most of eye fixations are located around the falling soccer. Therefore, it is necessary to extract motion information from video sequences to acquire saliency cues.

Optical flow reflects the changes and correlation in the time domain of the pixels between adjacent frames, which has been a prevailing way to describe motion information [25]. As the optical flow estimated by the method RAFT [26] shown in Fig. 1(c), most of the optical flow show well-defined salient objects that provide motion saliency cues, but others are blurred due to slow motion of the object or only partially moving within the object. Cong et al. [27] suggest that different motion states of objects can yield different optical flow estimates even in similar scenarios. The motion features extracted from the optical flow are concentrated around the salient object, as shown in Fig. 1(d). The motion features extracted from the blurred optical flow are more diffuse, which makes it difficult to find the exact location of salient objects. Therefore, the need to obtain more accurate saliency cues from motion features has become an urgent requirement for video saliency detection.

Existing optical flow-based models [28,29,30,31] for video saliency detection use two-stream networks to extract motion and spatial information separately, and then simply fuse them by concatenate, etc. These direct integration strategies ignore the fact that the two types of information come from different modalities. Lai et al. [31] enhanced spatial information with motion information through an attention mechanism and achieved good performance, but neglected the lack of accurate motion saliency cues in some motion features, resulting in the inability to efficiently aggregate motion and spatial information for the saliency detection.

To alleviate the above challenges as well as to compensate for the shortcomings of existing methods, we propose a video saliency detection model that combines spatial and motion information in a multiscale manner, consisting of spatial subnet, motion subnet, hierarchical fusion subnet and convGRU subnet. We follow a two-stream network structure consisting of two same CNN models, with the spatial and motion subnets extracting spatial and motion features from optical flow and video frames, respectively. We propose the motion feature fine-tuning module to fine-tune the motion features using multiscale spatial features in the feature extraction process, which can focus motion features on the salient objects. The fine-tuned motion features are shown in Fig. 1(e). For further studying the relationships accross different scale motion features and extracting more semantic information, we design a hierarchical fusion subnet to integrate spatial and motion features in a multi-scale pattern. Considering that the saliency of adjacent frames is correlated, the convGRU [32] subnet generates the final saliency map based on the saliency cues of the current frame together with the saliency results of previous frames.

To sum up, the main contributions of this paper are summarized as follows:

  1. 1.

    We proposed a novel layered network MFHF for video saliency detection, which contains four subnets, spatial, motion, hierarchical fusion and convGRU subnets. The proposed method can extract informative motion features and fuse spatial features to predict video saliency accurately.

  2. 2.

    We developed the motion feature fine-tuning module for extracting new motion features. A series of optical flow have been used as coarse motion features, which were be fine-tuned by incorporating spatial features with cross-connections in the last three layers of spatial subnet.

  3. 3.

    We designed the hierarchical fusion subnet for fully combining spatial and motion features on five different scales, which can retain more multi-scale contextual information.

The rest of the paper is organized as follows. In Section 2, we review some typical related work. Section 3 elaborates the proposed video saliency detection model. Section 4 reports the experimental results and ablation analysis of our model on four publicly available benchmark datasets. Finally, the conclusions are drawn in Section 5.

2 Related work

In this section, we review related studies on saliency detection for images and videos.

2.1 Salient object detection

Video salient object detection, which aims to segment the most obvious objects from the picture, has remained high in the computer vision research community and has a wide range of applications in optimal path planning [33] and robot navigation [34].

Many models [6,7,8,9,10,11,12,13,14] use optical flow to better segment moving salient objects. Some video salient object detection methods use the motion information of the optical flow to better segment moving salient objects. Li et al. [12] develop a motion guided video salient object detection network, which leverages the motion saliency sub-network to attend and enhance the sub-network for still images. Chen et al. [14] introduce the concept of motion quality and select video frames with high-quality motions as the new training set, which is used to fine-tune the model. Zhang et al. [10] used color contrast computation and optical flow computation to enhance spatio-temporal correlation, and combined with depth confidence optimization to accomplish stereoscopic video saliency detection.

2.2 Saliency detection

Earlier image saliency detection models typically used a bottom-up framework, also known as a stimuli-driven mechanism [31]. Many of these works are based on the calculation model of HVS, comprehensively considering color features, directional features and gray-scale features [35,36,37,38,39,40]. In the past years, several emerging deep learning models have achieved groundbreaking progress compared with traditional models. These saliency models based on deep learning mainly profited by extensive labeled training data and more expressive network structure. Vig et al. [16] used the Convolutional Neural Network to obtain feature vectors, then put them into the SVM [41] to generate the image saliency prediction results. SALICON [18] aimed to narrow the semantic gap and predict saliency results based on a pre-trained VGG-16 [42]. Similarly, DeepNet [19] connected more network layers into the VGG-16 and obtained more informative multi-scale features to promote the performance. SAM [43] iteratively enhanced the coarse feature to focus on the most salient region through a convolutional LSTM(convLSTM) [44] and a center priors attention mechanism. DVA [45] utilized the skip-layer network to acquire hierarchical saliency information, and achieved efficient properties for image saliency detection.

For video saliency detection, the spatial-temporal information contained in the video frames is critical. Similar to image saliency detection, traditional video saliency detection methods capture saliency cues based on hand-crafted spatial-temporal features [46, 47], but low-level hand-crafted features couldn’t deliver a satisfactory performance for modeling dynamic saliency. Recently, a good deal of models based on deep learning have been proposed that adopt different ways of acquiring temporal information. ACLNet [48] proposed a supervised attentive module to encode static attention and then used it in convLSTM [44] to learn dynamic saliency representation. SALDPC [49] captured motion information through the multi-scale temporal recurrence and can better guide video coding with saliency. Compared with temporal modules like convLSTM used in the above methods, optical flow has better motion sensing abilities [50], which are closely related to the acquisition of motion saliency cues.

Optical flow represents per-pixel motion between two consecutive frames [51], which is sufficient to establish the link between motion and saliency [28] and has become a prevalent way to reflect the motion situation of objects in video saliency detection. Bak et al. [28] developed a two-stream network to extract spatial and motion information from video frames and optical flow, respectively, and integrated them with max fusion or convolutional fusion for video saliency prediction. DeepVS [29] is an another well-known video saliency model, which extracted spatial and motion features via YOLO [52] and FlowNet [53]. Then the extracted two kinds of features were concatenated to generate spatio-temporal features. Finally, the convLSTM was used for learning inter-frame correlation. However, these methods only integrate motion features with spatial features use direct fusion strategies, ignoring the problem of motion feature diffusion caused by blurred optical flow, and making insufficient use of both features.

To address the above problems, we develop a two-stream structure and applied multi-layer spatial features to fine-tune the motion features, and influenced motion features tend to focus on salient regions. Furthermore, We combine spatial and motion saliency cues for saliency detection by fusing spatial features and fine-turned motion features in a multi-scale manner.

Fig. 2
figure 2

(a) The overall structure of our MFHF. (b) An illustration of convGRU configuration. (c) Motion feature fine-tuning. (d) Feature fusion structure for the layer four in the hierarchical fusion subnet

3 Our approach

3.1 Architecture overview

Research on human visual attention shows that moving objects tend to be more attractive and noticed [54, 55]. For video, the spatial features extracted from each frame and the motion information between consecutive frame are essential for saliency detection [31]. This inspired us to develop our MFHF with a two-stream [28] structure for extracting spatial features and motion features, as shown in Fig. 2(a). Our MFHF predicts video saliency map with four subnets, spatial, motion, hierarchical fusion and convGRU subnets. In detail, in the spatial subnet, we take the current frame \({{{\textbf {I}}}_{t}}\) as input to extract five-level spatial features \({\textbf { }}\!\!\{\!\!{\textbf { SF}}_{t}^{k}{} {\textbf { }}\!\!\}\!\!{\textbf { }}_{k=1}^{5}\) which generated from each block k of deep neural networks. In the motion subnet, the optical flow obtained by RAFT [26] is used as the input coarse motion frame and output multi-scale motion feature \({\textbf { }}\!\!\{\!\!{\textbf { MF}}_{t}^{k}{} {\textbf { }}\!\!\}\!\!{\textbf { }}_{k=1}^{5}\) by incorporating dense residual cross connections with \({\textbf { }}\!\!\{\!\!{\textbf { SF}}_{t}^{k}{} {\textbf { }}\!\!\}\!\!{\textbf { }}_{k=1}^{5}\). In the hierarchical fusion subnet, we combine both spatial and motion features in a multi-scale form to generate fused features \({\textbf { }}\!\!\{\!\!{\textbf { HF}}_{t}^{k}{} {\textbf { }}\!\!\}\!\!{\textbf { }}_{k=1}^{5}\). Also we refine the intermediate saliency maps of the last three levels with the fused features to guide feature extraction. In order to learn the inter-frame correlation of a video, we utilize convGRU [32] subnet to optimize the final result of the video saliency prediction.

3.2 Spatial and motion subnets

3.2.1 Spatial subnet

Considering that the visual information may be continuously lost in the convolution process and inter layer transmission, we expect to effectively extract spatial information of different scales. We build our spatial subnet based on ResNet-50 [56] which has faster forward and backward propagation as a residual network. Specifically, we retain the feature extraction layers of ResNet-50, and remove the fully connected layer to keep high-level spatial information. At time step t, the spatial subnet takes the current frame \({{{\textbf {I}}}_{t}}\) as input, then produces the spatial features \({\textbf { }}\!\!\{\!\!{\textbf { SF}}_{t}^{k}{} {\textbf { }}\!\!\}\!\!{\textbf { }}_{k=1}^{5}\) of different scales through five convolution blocks.

3.2.2 Motion subnet

For combination of multiscale spatial and motion features, we also use ResNet-50 to extract coarse motion features. Continuous optical flows \(\{{{{{\textbf {O}}}}_{t-4}},{{{{\textbf {O}}}}_{t-3}},...,{{{{\textbf {O}}}}_{t}}\}\) from the neighboring frames of \({{{\textbf {I}}}_{t}}\) are fed as inputs to obtain the corresponding motion features \({\textbf { }}\!\!\{\!\!{\textbf { MF}}_{t}^{k}{} {\textbf { }}\!\!\}\!\!{\textbf { }}_{k=1}^{5}\). The optical flow method is the most mainstream method of motion feature extraction, it still produce some unsatisfactory results in the calculation process, as shown in Fig. 1(c). To get a more accurate and robust high-level optical flow estimation, we inject the spatial features of the last three layers \({\textbf { }}\!\!\{\!\!{\textbf { SF}}_{t}^{k}{} {\textbf { }}\!\!\}\!\!{\textbf { }}_{k=3}^{5}\) with a cross connection to fine-tune higher level motion features \({\textbf { }}\!\!\{\!\!{\textbf { MF}}_{t}^{k}{} {\textbf { }}\!\!\}\!\!{\textbf { }}_{k=3}^{5}\).

The motion fine-tuned process in the fourth layer of the motion subnet is shown in Fig. 2 (b). For reducing the loss of spatial information between convolutional layers and utilizing multi-scale characteristics effectively, we use a nonlinear mapping P to joint the third and fourth layer spatial features \({\textbf {SF}}_{t}^{3}\), \({\textbf {SF}}_{t}^{4}\). And then the joint spatial feature and \({\textbf {MF}}_{t}^{4}\) are multiplied using a Hadamard product. Finally, the multiplication result is used to correct motion feature \({\textbf {MF}}_{t}^{4}\). In generation, the feature fine-tuning process can be formulated as:

$$\begin{aligned} {\textbf {MF}}_{t}^{k}={\textbf {MF}}_{t}^{k}+{\textbf {MF}}_{t}^{k}\odot P(\{{\textbf {SF}}_{t}^{i}\}_{i=3}^{k}),3\le k\le 5 \end{aligned}$$
(1)

where ‘\(\odot \)’ indicates the Hadamard operation, P is a nonlinear mapping that connect the spatial characteristics of different stages of the subnet.As is shown in Fig. 1,(d) represents the motion feature without fine-tuning process, and (e) represents the fine-tuned motion feature. We can find that compared with (d), the fine-tuned motion feature focus more on areas close to the saliency results, and the role of this fine-tuning process in objective indicators will also be further discussed in the ablation study.

3.3 Hierachical fusion subnet

For making full use of the important role of motion features in video sailency detection, we design a hierarchical fusion subnet to integrate the multi-scale spatial and motion features by using dense connections. In detail, we combine spatial features and motion features of each layer \({\textbf {SF}}_{t}^{k}\) and \({\textbf {MF}}_{t}^{k}\) to generate the fused feature \({\textbf {HF}}_{t}^{k}\). As an example, the architecture of the fourth layer is shown in Fig. 2(c). The upsampled lower level fusion feature \({\textbf {HF}}_{t}^{3}\) is used to concatenate with the corresponding level features \({\textbf {SF}}_{t}^{4}\) and \({\textbf {MF}}_{t}^{4}\). In general, the hierarchical fusion operation is expressed as follows:

$$\begin{aligned} {\textbf {HF}}_{t}^{k}=\left\{ \begin{array}{l} h([{\textbf {SF}}_{t}^{k},{\textbf {MF}}_{t}^{k}]),k=1 \\ h([{\textbf {SF}}_{t}^{k}+{\textbf {MF}}_{t}^{k},Up({\textbf {HF}}_{t}^{k-1})]),1<k\le 5 \end{array}\right. \end{aligned}$$
(2)

where h denotes the feature fusion operator, Up denotes Bilinear interpolation upsample operator, and \(\left[ . \right] \) denotes the channel-wise concatenation operator. \({\textbf {SF}}_{t}^{k}\) and \({\textbf {MF}}_{t}^{k}\) denote the attentive spatial and fine-tuning motion features from the k convolutional layer of the spatial and motion subnet, respectively. Through the fusion operation, each stage of the fusion subnet will be supervised by the fusion features of the previous stage, the space and motion features of the corresponding scale, which can continuously update and optimize the sailency detection results.

For further monitoring the different stages of the fusion subnet, we upsample \({\textbf {HF}}_{t}^{k}\) using Bilinear interpolation in the last four layers, and utilize a convolution layer with a \(3\times 3\times 1\) kernel to obtain the corresponding output \({\textbf {S}}_{t}^{k}\):

$$\begin{aligned} {\textbf {S}}_{t}^{k}=Conv(Up({\textbf {HF}}_{t}^{k})), 2\le k\le 5 \end{aligned}$$
(3)

where Up denotes the Bilinear interpolation, which upsampled to the size of \(224\times 224\) with strides of 8, 16, 4 and 4, respectively. As the output result of each frame after passing through the hierarchical fusion subnet, \({\textbf {S}}_{t}^{k}\) will enter the convGRU subnet as input to learn the correlation between frames.

3.4 ConvGRU subnet

Compared with images, video has strong inter frame correlation that can not be ignored. The convGRU subnet, which has fewer parameters and higher computational efficiency compared to convLSTM [44], is utilized to model the inter-frame correlations and improve the dynamic saliency of videos. We first concatenate four-scale saliency maps \({\textbf { }}\!\!\{\!\!{\textbf { S}}_{t}^{k}{} {\textbf { }}\!\!\}\!\!{\textbf { }}_{k=2}^{5}\) as the input \({{x}^{t}}\) and then feed it into the convGRU unit, which introduces gating mechanism to learn the inter-frame correlation and temporal information:

$$\begin{aligned} \begin{aligned}&{z}_{{}}^{t}{=}\sigma {(W}_{h}^{z}*{h}_{{}}^{t-1}+{W}_{x}^{z}*x_{{}}^{t}{)}\\&{r}_{{}}^{t}{=}\sigma {(W}_{h}^{r}{*h}_{{}}^{t-1}+{W}_{x}^{r}*x_{{}}^{t}{)}\\&{\tilde{h}}_{{}}^{t}=tanh{(W}_{h}^{h}*({r}_{{}}^{t}\odot { h}_{{}}^{t-1}{)}+{W}_{x}^{h}{*x}_{{}}^{t})\\&{h}_{{}}^{t}{=z}_{{}}^{t} {\tilde{h}}_{{}}^{t}+{(1}-{z}_{{}}^{t}{)h}_{{}}^{t-1} \end{aligned} \end{aligned}$$
(4)

where r, z denote the reset and update gate, respectively, h represent the hidden states, W represents the learnable weights, ‘\(*\)’ is the convolution operator. A sample of the convGRU configuration is shown in the top right of Fig. 2(a). Compared with LSTM [57] structure, convGRU [32] greatly reduces the amount of parameters in the fitting process and saves the time and calculation cost. The saliency detection result of our model \({\textbf {S}}_{t}^{fin}\) will be produced by convolution of the hidden states \(h_{{}}^{t}\).

3.5 Loss function

We propose a loss function which is suitable for saliency prediction with our network structure. At time step t, both the final output \({\textbf {S}}_{t}^{fin}\) and the intermediate layers outputs \({\textbf {S}}_{t}^{k}\) are merged to supervise the network. The overall loss \({{\ell }_{all}}\) is designed as:

$$\begin{aligned} {{\ell }_{all}}=\ell ({\textbf {S}}_{t}^{fin},F,G)+\sum \nolimits _{k=3}^{5}{\ell ({\textbf {S}}_{t}^{k},F,G)} \end{aligned}$$
(5)

where G denotes the ground-truth saliency map, F denotes binary fixation map.

Similar to SAM [18], we design our loss \({\ell }\) by combining four loss metrics linearly, which can monitor the quality of the prediction results from several different quality factors. Specifically, our loss function can be expressed as:

$$\begin{aligned} \begin{aligned} \ell (S,F,G)=&{{\ell }_{kl}}(S,G)+{{\omega }_{1}}{{\ell }_{cc}}(S,G)\\&+{{\omega }_{2}}{{\ell }_{nss}}(S,F)+{{\omega }_{3}}{{\ell }_{sim}}(S,G) \end{aligned} \end{aligned}$$
(6)

where S denotes the predicted saliency map, and \({{\omega }_{1}}\), \({{\omega }_{2}}\), \({{\omega }_{3}}\) indicates the weights of four loss metrics. They are set to 0.2, 0.1, 0.1 respectively.

\({{\ell }_{kl}}\) is obtained according to Kullback-Leibler (KL) divergence metric:

$$\begin{aligned} {{\ell }_{kl}}(S,G)=\sum \nolimits _{i}{{{S}_{i}}\log }(\frac{{{G}_{i}}}{{{S}_{i}}}) \end{aligned}$$
(7)

where i indexes the \({{i}^{th}}\) pixel.

\({{\ell }_{cc}}\) is derived from Linear Correlation Coefficient (CC) metric that can be calculated using the following:

$$\begin{aligned} {{\ell }_{cc}}(S,G)=-\frac{{\text {cov}}(S,G)}{\rho (S)\rho (G)} \end{aligned}$$
(8)

where \(cov(\cdot )\) represents the calculation of the covariance operation, \(\rho (\cdot )\) stands for the standard deviation.

\({{\ell }_{nss}}\) is derived from Normalized Scanpath Saliency (NSS) metric, which can be expressed as:

$$\begin{aligned} {{l}_{nss}}(S,F)=-\frac{1}{N}\sum \nolimits _{i}{\frac{{{S}_{i}}-\mu (S)}{\rho (S)}}\times {{F}_{i}} \end{aligned}$$
(9)

where \(N=\sum \nolimits _{i}{{{F}_{i}}}\) denotes the total number of fixated pixels, \(\mu (\cdot )\) represents the calculation of mean, \(\rho (\cdot )\) represents the standard deviation method.

\({{\ell }_{sim}}\) is derived from Similarity (SIM) metric, which is used to quantify similarity and can be represented as:

$$\begin{aligned} {{\ell }_{sim}}(S,G)=-\sum \nolimits _{i}{\min ({{S}_{i}},{{G}_{i}})} \end{aligned}$$
(10)

where S and G represents normalized probability distributions.

4 Experiments

In this subsection we will describe the datasets, evaluation metrics and implementation in detail.

4.1 Experimental setup

1).Datasets: To evaluate our method, the four most popular datasets UCF Sports [23], Hollywood-2 [23], DIEM [58] and DHF1K [48] are used for performance analysis. The statistics for four datasets are summarized in Table 1.

Table 1 Statistics of four typical video saliency detection datasets we used
Fig. 3
figure 3

The training and validation loss for each epoch in the training phase

UCF Sports is collected by 19 observers and divided into 103 and 47 videos for training and testing sets, respectively. Particularly, UCF Sports consists of a series of sport-related videos with a resolution of 720*480, which makes that the saliency detection model needs strong adaptability to deal with similar scene.

Table 2 Quantitative comparison with 14 methods on UCF Sports

Hollywood-2 consists of a total of 1707 video data, which provides an important reference for a comprehensive evaluation of model performance. It contains 823 and 884 videos in the training and testing sets. Different from UCF Sports, Hollywood-2 has diverse background information and a wide range of moving objects, which make Hollywood-2 more challenging and difficult.

DIEM is observed by 50 volunteers, which is composed of multiple movie clips with extremely different styles and a high resolution of 1280*720. In DIEM, 64 videos are used for training and 20 videos are used for testing.

DHF1K is the most complex video saliency dataset. It consists of 1000 diverse videos, whereas only 700 videos have high-quality annotations, which are normally used for training/validating the model, and the remaining 300 videos are used for testing the model with the help of the owner. Unlike the previous two datasets, which collected fixations in a given visual task, both DHF1K and DIEM collect fixations with free-viewing.

2).Evaluation Metrics: We adopt five popular metrics to evaluate the results, including Area under the Curve by Judd(AUC-J) [59], Shuffled AUC(s-AUC) [60], Linear Correlation Cofficient(CC), Normalized Scanpath Saliency(NSS), and Similarity(SIM). For all of these metrics, higher scores indicate better performance.

Table 3 Quantitative comparison with 14 methods on Hollywood-2
Table 4 Quantitative comparison with 14 methods on DHF1K

3).Implementation Details: Our model is implemented with the Keras framework on a single Nvidia Tesla GPU (with 16G memory) and 2.2 GHz Intel Xeon CPU E5-2630 v4 CPU. During the trianing phase, we set the video batch size to 1 and frame batch size to 5. The whole model is trained in an end-to-end anner. The number of training epochs is set as 300. The learning rate is set to 1e-5, which remains unchanged during the training phase. Parameters of the model are learnd on the training data by Adam [61] optimizer.

Figure 3 shows the training loss and validation loss of the proposed model when trained on the UCF Sports dataset. The validation loss of the model is minimized when the number of epochs reaches 260, and we save the model weights at this point for testing.

4.2 Comparison results

We rigorously compare our MFHF with 16 other state-of-the-art methods, including SALICON [18], Two-streams [28], DeepNet [19], ACLNet [48], DVA [45], DeepVS [29], Shallow-Net [19], SalEMA [62], TASED-Net[63], GDLC [64], STRA-NET [31], KSORA [65], DeepCT [66], STA-3D [67], ECA-Net [68] and the video saliency detection model proposed by Chen et al. in [69]. For a fair comparison, we directly adopt the published results in the corresponding paper.

4.2.1 Performance on UCF sports

We trained MFHF with UCF Sports training videos and evaluated it on the test set in the normal way. Table 2 shows the performance of all models. Compared to other video saliency detection methods, our MFHF has advantages in AUC-J, SIM and NSS metrics. This may benefit from the fine-tuned motion features resulting more efficient and reliable for describing moving regions. The motion feature visualisation results, given in Fig. 1(d) and (e), also show that the fine-tuned motion features are more focused and provide clearer saliency cues than the original motion features.

4.2.2 Performance on Hollywood-2

All 823 videos in the training set are used to train our model, and Table 3 shows the performance of our MFHF and other 15 models on Hollywood-2. The proposed method achieves the best results for the AUC-J, SIM and NSS metrics and the next best results for the s-AUC and CC metrics. It may be due to our hierarchical fusion subnet, which can retain more multi-scale information and achieve better adaptability in complex scenarios. Simultaneously, our multi-level loss function design also plays a vital role in the performance improvement, which effectively supervises the extraction and fusion of features.

Table 5 Quantitative comparison with 3 varients of TASED-Net on the validation set of DHF1K

4.2.3 Performance on DHF1K

As suggested in [48], we also split the video sequences to 600/100/300 to train/validate/test our model. Quantitative results on the test set of DHF1K are shown in Table 4. From the Table 4, we can see that TASED-Net achieved the best performance on four metrics although it doesn’t perform well on the first two datasets. The proposed method gives better results compared to the rest methods.

To further understand the efficiency of our model, we analyze the influence of frame batch sizes on the detection performance. TASED-Net gets the highest s-AUC value on DHF1K, which 32 past frames are used for the next frame saliency prediction. For our model, only 6 frames are included for estimation. Table 5 shows the comparisons of our model and three variants of TASED-Net on DHF1K, which includes the results of input 4, 8 and 32 frames given in [63]. It indicates the performance can vary with the number of input frames. Although fewer frames are involved in the network, our model gets the best performance on two metrics. The proposed model is qualified to capture spatio-temporal saliency cues with fewer frames.

Table 6 Quantitative comparison with 14 methods on DIEM

4.2.4 Performance on DIEM

DIEM has 84 high-resolution videos with plentiful common life scenes, in which normally 64 videos are used for training, and 20 videos are left for testing. In [31], authors compared the performance of STRA-Net on DIEM with different other training datasets and demonstrated that more training samples could improve the performance of the model and training with Hollywood-2 get the comparable results. To evaluate the generalization ability, all 20 testing videos are used as test samples following [31]. For simplicity, we test the performance of our network which only be trained with Hollywood-2 and the results are shown in Table 6. Noted that we only list the best results of STRA-Net on DIEM, which uses plenty of training sets, while we only use the parts of training data. The results are shown in Table 6 clearly illustrate our approach achieves competitive performance on most metrics and has comparable generalization capability with other advanced models.

Table 7 Ablation study on UCF sports
Table 8 Quantitative comparison of model variants with different settings on UCF Sports
Table 9 Quantitative comparison with 14 methods on DHF1K
Fig. 4
figure 4

Visualization of saliency detection. (a) Frame. (b) Optical flow. (c) Ground truth. (d) HF-Output1. (e) HF-Output2. (f) HF-Output3. (g) Ours(w/o convGRU). (h) Ours(w. convGRU)

4.3 Ablation study

From the previous subsection, our MFHF performs well on multiple datasets, which may be due to our motion fine-tuning and hierarchical fusion module. For verifying the effectiveness of each component, we designed kinds of MFHF variants and tested them on UCF sports.

We firstly explore the effectiveness of the proposed motion feature fine-tuned module and hierarchical fusion module in this section. The first row in Table 7 is the baseline which have the two-stream structure, the fifth row is our method and in rows 2 to 4 we have different combinations of our modules. The effectiveness of the motion feature fine-tuned module and hierarchical fusion module can be verified from the comparison results in rows 1-3. The result in line 4 shows that the above two modules are valid when combined. The last two rows of the results in Table 7 demonstrate the efficiency of the ConvGRU as well. Combining all of them achieves the best performance.

To further verify the effectiveness of the structure of the proposed model, We introduce modules from other methods to conduct comparative experiments. We used the spatial enhancement module in STRA-Net [31] to replace the motion feature fine-tuned module, comparison results are shown in the first and fourth rows of Table 7. Compared to the spatial enhancement module in STRA-Net, the motion fine-tuning module we used makes more effective use of motion information to model video saliency. To verify the effectiveness of hierarchical fusion, we used fusion machanisms such as convolutional fusion [28] and max fusion [28] in our framework, they only fuse single-layer features. We can see from the comparison of the last three rows that the hierarchical fusion machanism achieves better performance using multi-scale contextual features (Table 8).

In addition, we compared the saliency maps generated by the fusion features HF-Output1, HF-Output2 and HF-Output3, which are from the last three layers of the hierarchical fusion subnet. These saliency maps are represented as \(\text {S}_{t}^{3}\), \(\text {S}_{t}^{4}\), \(\text {S}_{t}^{5}\) in Section 3.3, respectively. The overall efficiency of multiscale fusion features for video saliency was assessed by comparing the results shown in of Table 9 and the visualization results shown in Fig. 4. These show that the multilevel features is helpful for modeling video saliency, the deeper hierarchical fused features is better to distinguish saliency parts in the videos.

5 Conclusions

In this paper, we propose a novel spatial and motion dual-stream framework to model video saliency detection. In order to get saliency related features, a dual-stream network was introduced to extract multiscale spatial features and then used to fine-tune motion features in a dense residual cross connection architecture. With the help of the higher level semantic spatial features, the fine-tuned motion features can get some saliency related features. Then, we fuse the multi-scale features with the hierarchical fusion subnet to retain more contextual saliency information. Also we integrate the multi-scale saliency map at the same frame to a loss function for monitoring the saliency detection process. ConvGRU subnet is revised to get the relationship between the frames. Extensive results on four video saliency benchmark datasets demonstrate the superiority of the proposed model to precisely predict dynamic human fixations. The ablation experiments show the necessity and effectiveness of each component in our model. Motion features play an important role in the video processing, if the fine-tune motion feature can implement in the transformer framework, the final results can further be improved.