1 Introduction

Video saliency detection aims to recognize interesting zones in dynamic scenes by simulating the attention mechanism of human’s eyes. Compared with detecting salient objects in images, video saliency object detection encounters new challenges posed by continuous movement clues and abundant inter-frame information. Differential features of consecutive frames can be used to ensure a better performance in background suppression and edge detection. Video saliency detection has been widely applied in video compression [1], video target tracking [2], video quality assessment [3], video summarization [4], and scene understanding [5,6,7,8,9], image understanding [26,27,28,29,30,31,32], action/object recognition [33, 34], etc.

Prevailing video saliency detection methods generally provide macroscopic perspectives and use optical flow and LSTM convolution models to extract temporal characteristics. However, these methods often overlook some interframe details when extracting and integrating features and result in poor continuity of differential features between consecutive frames.

To solve the above-mentioned problem, this paper proposes a static feature extraction network with several static saliency networks working in parallel to extract the information features of video frames and obtain initial temporal information. Interframe differential information between consecutive frames is used to set up a co-attention mechanism. As designed, the co-attention edge module and central module integrate interframe relationships. And the temporal relationships between differential information of consecutive frames are used in assisting learning the temporal and spatial continuity of interframe movements. The detection results have been obviously improved by combining the co-attention module and the pixel gradient-based refinement module. At the same time, we also apply recurrent multi module training optimization and the loss function self-adjustment strategy to further modify the network modal.

The main contributions of the paper are as follows:

  1. (1)

    A fully convolutional neural network integrating temporal difference and pixel gradient has been designed, which effectively solves the poor performance of temporal spatial coherence and edge continuity.

  2. (2)

    A co-attention mechanism is set up to integrate temporal differentials to solve the problem of present network models lacking good learning of information continuity;

  3. (3)

    A method for optimizing image details is proposed based on pixel gradient to rectify the details of salient objects and improve edge continuity and details of the central areas of targets;

  4. (4)

    A self-adjustment strategy for the loss functions is used to improve the stability of the network structure to avoid overfitting.

  5. (5)

    Our model has tested on three public datasets, and the experimental results prove its effectiveness.

2 Related work

2.1 Image saliency object detection

Image saliency object detection can be realized by two network models supported by different attention mechanisms: top-down and bottom-up models. The top-down method deals with images under the control of human’s consciousness. Yang et al. [10] developed a saliency model combining conditional randomness and visual dictionary, which learned conditional random fields by way of characteristic self-adjustment; and as CRF are output layers, it can learn the dictionary under structural supervision. Gao et al. [11] proposed a bottom-up saliency detector in line with the statistics of characteristics of natural images. Cheng et al. [12] proposed a salient object extraction algorithm based on regional contrast degree, which can calculate both the global contrast difference and spatial weighting consistency. In [35], Hu et al. proposed a novel co saliency detection framework that transforms each image into a semantic superpixel cluster through segmentation, and generates images of different sizes for each input image using the VGG-16 model. The model also explores a new Graph Convolutional Network method to fine-tune multi-scale features, capturing common information between all scale features and private or complementary information for each scale feature. Zhang et al. [37] developed a new deep network framework, DeepACG, which uses Gromov Wasserstein distance to establish dense 4D correlated volumes for all image pixel pairs. This model also designs a semantic aware Common Attention Module (SCAM), which recognizes the semantic classes of foreground collaborative objects and modulates this information into deep representations to achieve the localization of relevant pixels, and the model also utilizes a Contrast Edge Enhancement Module to capture richer contexts and preserve fine-grained spatial information.

2.2 Video saliency object detection

Compared with saliency object detection of images, video saliency detection is more challenging because of involving more complicated detection situations and temporal information. As a response to the problem, Yu et al. [13] proposed a video saliency incorporating spatiotemporal cues and uncertainty weighting, Wang et al. [14] proposed a spatiotemporal saliency detection for video sequences based on the gradient flow field and energy optimization. In addition to the problem in utilization of spatiotemporal information, a lack of massive annotated video datasets has always been a problem. Wang et al. [15] proposed a data enhancement technology which can simulate video training data by using available annotated image data so that the network can learn different salient information and prevent overfitting training videos which are limited in capacity. In [36], Wang et al. proposed a key saliency object re enhancement method based on bottom-up weighted features and top-down semantic knowledge to better simulate human visual attention mechanism in dynamic scenes. This method effectively eliminates noise and redundancy, and provides accurate local spatiotemporal features for saliency detection, further improving the accuracy of detection.

Unlike existing image or video saliency detection methods, this paper presented a new method based on an image detection framework. This method combines the temporal differences of video frames and image saliency detection framework and can capture and utilize the gradient changes of pixels to improve overall visual effects of salient objects.

3 Proposed method

In this paper, we propose a new method to improve the edge continuity and central details of video saliency detection. Our research is composed of three parts: the joint static feature extraction network (SFN), temporal difference co-attention model (TCM), and pixel gradient-based refinement processing module (GRM). Figure 1 is a diagram of the method. SFN aims to extract temporal features on the basis of an image saliency detection method. After SFN extraction of information, TCM integrates temporal difference information via a co-attention mechanism, enhance spatiotemporal consistency between video consecutive frames, and draw attention to both edge and central areas. Then GRM improves details in the edge and central areas by capturing the pixel gradients of initial images. In addition, a self-adjustment strategy is adopted for loss functions to avoid overfitting during training. The network framework of this work is shown in Fig. 1. Specifically, the raw data first enters the SFN network for feature extraction, and then inputs the TCM module for edge and center feature information enhancement. Then, the GRM module improves the detail information of the features, and then enters the TCM module for the second feature information enhancement, as well as the GRM module for the second feature information improvement. Residual connections have been added to supplement the information lost via feature enhancement through convolution, ensuring that modules do not mistakenly learn unimportant information due to information loss and preventing gradient explosion and fitting problem. Then, the sigmoid function is used to binarize the feature information and calculate the loss based on the results. The following section will provide a specific explanation of the module designed in this article.

Fig. 1
figure 1

Diagram of the designed network framework. SFN is a joint encoder and decoder network composed of three image saliency detection modules, whose initial weights are obtained by training using DUTS-TR. TCM and GRM recursively extract and deal with temporal differences of the edge or central areas of high-level semantic information output by SFN via skip-connection

3.1 Joint static feature extraction network

In view of the excellent feature extraction of VGG16, this paper uses U-Net with a VGG16 encoder as a subnet of the joint static feature extraction network to extract initial features of video frames. Because of lacking a sufficiently large annotated database for video saliency training, this paper uses the database of DUTS-TR to train the U-Net model and copy the weights obtained from training to the subnets for extracting features of the former, middle, and latter frames. It can ensure the highest relevancy of information output at the beginning of overall dynamic network training (Fig. 2).

Fig. 2
figure 2

Joint static feature extraction network

We have designed unilateral information communication channels for three static extraction networks (U-Net) and used Formulas (1) and (2) to integrate high-level semantic information at the bottom of U-net to make sure middle frames can acquire more common features.

$$\textrm{f}={\left(\textrm{conv}5\right)}_{\textrm{U}1}\times {\left(\textrm{conv}5\right)}_{\textrm{U}3}$$
(1)
$${\left(\textrm{conv}5\right)}_{\textrm{U}2}={\textrm{conv}}_{1\times 1}\left(\textrm{cat}\left(\textrm{f},{\left(\textrm{conv}5\right)}_{\textrm{U}2}\right)\right)$$
(2)

where f represents the features after fusion of features of the 5th layer; conv5 represents the features of the 5th layer; U1 represents the first sub-net; U2 represents the second sub-net; U3 represents the third sub-net; cat (*) represents the overlapping of features via channels; conv1×1 (*) represents dimensionality reduction via 1×1 convolution.

3.2 Temporal differential collaborative attention module

Compared with image saliency detection, video saliency detection has its advantage: clues of sequential movements. To improve the performance of video saliency detection, we design a temporal differential co-attention edge module and central module aiming at the edge and center zones of targets, respectively.

As shown in Fig. 3(a), the co-attention edge modules of former frames integrates saliency features of SFN outputs through coordinating inter-frame relationships and then assign weights spatially. This module is composed of two parts: the former frame and the rest. The former frame is trained with annotated GT maps and the rest trained with the union of GT maps to obtain as much edge information as possible. Multi-scale pooling and multi-scale convolution are used for former frames to integrate their information. For the other part, multi-scale pooling and upsampling are used for dimension reduction and integrated information is added to the former frames. Overlapping information is deeply processed and reduced to one dimension via 1×1 convolution blocks, and then is multiplied by I1_(t-1) to put weight on initial data. Similar to Fig. 3(a) and (b) puts together temporal differences between middle frames and former and latter frames and adds them to middle frames. After dimension reduction, information of middle frames is updated and more attention can be paid to their edge zones. The TCM central module is designed following the similar principle except that the intersection of GT maps of other parts is used rather than their union, and convolution and pooling are conducted before upsampling.

Fig. 3
figure 3

Temporal difference co-attention edge module,I1_(t-1) represents the former frame information output by SFN, I1_(t) represents the middle frame information output by SFN, I1_(t+1) represents the latter frame information output by SFN, I2_(t) represents the updated middle frame information, I3_(t) represents the updated middle frame information by the gradient transform module, Avgp3,6,9 represents the 3×3, 6×6, and 9×9 multiscale pooling operations, Conv3,6,9 represents 3×3, 6×6, and 9×9 multiscale convolution operations, Res represents residual block, conv1×1 represents 1×1 convolution block, ⨂ represents element-by-element multiplication, and ⨁ represents channel and channel concatenate

3.3 Pixel gradient-based refinement processing module

As in original images, pixel values are approximate where salient objects locate but are distinct from those of the background along their edges, so pixel gradient can be properly used to set up a barrier to separate the inside from the outside. Once obtaining the profiles of salient objects, the pixel gradient can act as a powerful tool for capturing and utilizing the details of salient objects. In this paper, we use the TCM modules to integrate the location information of edge zones and central zones to support the gradient-based optimization module. As shown in Fig. 4, based on information provided by the TCM edge module, we define the edges of objects through shrinkage first and then expansion. It can help edge optimization get rid of interference by pixel information outside the edge. Meanwhile, as the size of salient objects varies, convolution blocks are designed specifically for multiple-scale pixel expansion. A more balanced approach is used to optimize the information output by TCM central modules. On the basis of shrinkage and expansion, the approach of expansion followed by shrinkage is also used for optimization because besides the problem of hollows, the integrity of edge information needs to be considered when dealing with central information. In addition, to give full play the combined effects of the pixel gradient module and TCM, they are jointly used to recurrently optimize the network structure.

$${\textrm{s}}_{\textrm{g}}=\max \left(\textrm{S}\right)+\min \left(\textrm{S}\right)-{\textrm{s}}_{\textrm{g}}$$
(3)
Fig. 4
figure 4

Pixel gradient refinement module, First, initial saliency maps are obtained after dimension reduction of information from TCM by using 1×1 convolution blocks and then binarization by sigmoid functions. Besides, 8 pixel gradient information maps are acquired for each pixel point at its proximity by using pixel gradient convolution blocks, and these maps are used to expand or shrink the areas of similar pixel values for initial saliency maps. Expansion can be accomplished by using expansive convolution blocks and shrinkage accomplished by expansion after reversing the colors of saliency maps. (as shown in Formula 3)

S g is gray value, max (S) is the maximum value in the saliency map, and min (S) is the minimum value in the saliency map.

3.4 Loss function self-adjustment strategy

As several TCM modules are involved in our method, which in turn require several loss functions to guide spatial attention, it is hence very necessary to coordinate the relationships between loss functions. Besides, as completely annotated video saliency detection datasets are lacked, overfitting often happens during training. To solve these two problems, a self-adjustment strategy is proposed for loss functions.

The loss function self-adjustment strategy includes testing the generalization performance of network training as shown in Fig. 5 and then adjusts the learning rate and the proportion of loss functions of different modules. In total, 51 binary cross-entropy loss functions are used in this paper, and 51 results obtained corresponding to 51 losses. If 26 of the losses pass the test, the initial learning rate remains unchanged; otherwise, it needs to be 1/10 the initial one. At the same time, the proportion of loss functions should be adjusted according to the number of losses tested qualified: The more qualified losses in a module, the larger proportion it will account for. The formula for the binary cross entropy loss function is expressed as follows:

$$Loss=-\frac{1}{N}\sum_{i=1}^N{y}_i\bullet \log \left(p\left({y}_i\right)\right)+\left(1-{y}_i\right)\bullet \log \left(1-p\left({y}_i\right)\right)$$
(4)
Fig. 5
figure 5

Flow chart of loss function generalization test

4 Experiment

4.1 Datasets

Three commonly used benchmark datasets are used for evaluation in this paper: DAVSOD [16], SegTrackV2 [17], and DAVIS 2017 [18]. The dataset of DAVSOD (Densely Annotated VSOD) is composed of 226 motion video sequences showing obvious motion displacements of salient objects, and hence, is of great value for evaluating how well motion laws are captured. The dataset of SegTrack V2 includes 14 sequences of video frames featuring birds, animals, vehicles, and mankind, aiming at evaluating the effectiveness of the detection method when external interferences in their motions are severe. DAVIS 2017 (Densely Annotated Video Segmentation 2017) is composed of 90 high-quality, full HD video sequences, 6242 frames in total, and every frame has fully pixel-level annotated groundtruth data. DAVIS 2017 has been so far the most sophisticated dataset that covers all kinds of challenges in the way of segmentation of video objects, such as sheltering, motion ambiguity, and changes of appearance, and can evaluate the detection method from all aspects.

4.2 Evaluation criteria

To comprehensively evaluate the performance of the method, we compare it with 6 popular methods by using three indicators, which are precision-recall (PR) curves, the maximum F-measure [19], and mean absolute error (MAE).

$${\textrm{F}}_{\beta }=\frac{\left(1+{\beta}^2\right) Precision\times \textrm{Recall}}{\beta^2\times Precision\times \textrm{Recall}}$$
(5)

β2 is used to put weight on precision and recall to reflect their importance, where β2 is 0.3, indicating precision is more important. P-R curves and F-Measure are used to measure how perfectly saliency maps match GT after binarization. MAE can measure the level of similarity between the primary saliency map and GT by the average absolute difference in pixel values. The three indicators can demonstrate the value of the method proposed in the paper.

Our model is built within the PyTorch framework. To fit the training, we trained the model using Random Gradient Descent (SGD) to 300 epochs with a batch size of 16. This article used a total of 51 binary cross entropy loss functions, resulting in 51 results corresponding to 51 losses. The initial learning rate is set to 0.01, and if 26 losses pass the test, the initial learning rate remains unchanged; Otherwise, it is 1/10 of the initial value. And all experiments in this article were completed on NVIDIA RTX 1080 GPU.

4.3 Performance Comparison

The method in this paper is compared with 8 other advanced detection methods, which are SSAV [20], PDB [21], FGRNE [22], SCOM [23], MBN [24] and LSTI [25]. The superiority of our network model has been highlighted in Figs. 6 and 7, and Table 1. For the sake of fairness, saliency maps used in the paper come from their authors.

Fig. 6
figure 6

P-R Curves in terms of three different databases

Fig. 7
figure 7

Comparison of the results of different video saliency detection method

Table 1 Comparison of MAE and maximum F-measure of different methods

4.3.1 Quantitative analysis

To quantitatively analyze the performance of the proposed method, three indicators including PR curves, maxFβ, and MAE are used in the section for comparing it with other mainstream video saliency detection methods. As indicated by PR curves in Fig. 6, our method has shown an excellent result with its precision and recall being the best among these methods as tested with all three datasets. As shown in Table 1 listing maxFβ and MAE of all methods, ours is 0.57%, 0.24%, and 0.34% higher than the second best method in terms of maxFβ when being tested with datasets of DAVIS 2017, DAVSOD, and SegTrackV2, respectively; when it comes to MAE, our method is 0.10%, 0.50%, and 0.19% lower. Our method has been proved to be effective as it leads other methods by wide margin in tests with all three different kinds of datasets.

4.3.2 Qualitative analysis

Figure 7 shows comparison of visual results from our network model and other network models. According to this comparison, the images based on our method have more clear-cut edges and more details in its center and are free from central hollows, achieving better visual effects than other methods. Our method has secured satisfactory effects for different kinds of primary images, which either have low contrast between the background or object (the first group), high contrast of the object itself (the second group), complicated background (the third group), or many details in the object (the fourth group).

4.4 Ablation experiment

In this sector, ablation experiments are conducted to confirm the effectiveness of each module of our model, for which DAVIS 2017 and SegTrackV2 are used.

4.4.1 Effectiveness of GRM

After adding the GRM module, the edge continuity for salient objects has been improved. As shown in Table 2, indicators for both DAVIS 2017 and SegTrackV2 are better than before. But from Fig. 8, it can be found that despite the better detection performance for edges, there are hollows in central areas.

Table 2 MAE and maxFβ for the ablation experimental method
Fig. 8
figure 8

Visual effect of Ablation Experiment

4.4.2 Effectiveness of GRM and TCM

In view of the different results in detecting central and edge areas, TCM is used to draw attention to both the central and edge areas. Then GRM is used to deal with different areas distinctively to ensure edge continuity and avoid hollows in central areas. As can be seen in Fig. 8, hollows in the central area after applying TCM are much less than before.

4.4.3 Effectiveness of REC and Self_Loss

As shown in Table 2, recurrent optimization and self-adjustment loss functions (Self_Loss) have improved the final outcomes. The recurrent optimization strategy (REC) has exploited the potential of the combination GRM and TCM at maximum, and Self_Loss effectively alleviate overfitting present in training and also help maintain the stability of network structure. As shown in Fig. 8, these strategies have contributed to a better detection performance.

5 Conclusion

This paper proposes a deep learning method for video saliency object detection based on temporal difference and pixel gradient. The detection model is mainly composed of temporal differential co-attention module and pixel gradient refinement module. The co-attention module makes statistics of spatiotemporal information of salient objects, and then the pixel gradient refinement module conducts pixel-level optimization of video frames. And the recurrent optimization strategy is applied to improve the precision of spatiotemporal saliency results. In addition, we propose the loss function self-adjustment strategy to enhance net stability and deal with problems like overfitting. Experiments with several publicly available datasets verify that the designed method can provide better indicators and visual effects.

Current research has demonstrated that our designed common attention mechanism and pixel gradient refinement module can effectively solve the problem of poor continuity of differential features between consecutive frames. However, video saliency object detection still faces various challenges brought by continuous motion clues and rich inter frame information. Thus, the current research has not been able to fully solve all the issues of video saliency detection under complicated scenarios in the wild, which is our future research direction.