1 Introduction

One of the most enduring research problems in computer vision is the video saliency prediction. It aims to find the most noticeable regions of a dynamic scene. Video saliency prediction models have been extensively adopted in a variety of video processing applications, including video compression [1, 2], video captioning [3], and video surveillance [4, 5].

Historically, video saliency prediction methodologies have traditionally relied on the amalgamation of visual cues, including intensity, color, motion, and spatial frequency. These cues are woven together to generate a saliency map, offering insights into regions of visual saliency. However, earlier methods exhibited limitations, manifesting in their inability to effectively incorporate temporal dependencies and the indiscriminate treatment of individual pixels, often leading to suboptimal results. In recent years, the advent of deep learning has been combined with the proliferation of high-quality video saliency prediction datasets, which has catalyzed significant advances in video saliency prediction and the emergence of several different model categories. These models include two-stream models [6,7,8,9], LSTM-based models [10,11,12], and 3D convolutional models [13,14,15]. The two-stream model acquires temporal and spatial information separately, and then fuses them into the spatio-temporal information to obtain the final saliency map. However, the two-stream model often extracts temporal information based on optical flow, so it only considers the temporal information between adjacent frames. The emergence of the LSTM-based model alleviates this limitation because LSTM can extend temporal perception. However, since the LSTM-based model processes spatial information and temporal information with convolutional networks and LSTMs respectively, the model cannot use spatial information and temporal information at the same time, which is important in the field of saliency prediction. To address this issue, Min et al. [13] proposed TASED-Net, a model founded on a 3D convolutional network, specifically devised for the joint processing of spatio-temporal information. While there has been substantial progress in 3D convolution-based models, they are still unable to overcome the inherent limitations of local receptive fields. This is where our research strives to reach a solution. Since the long and short term memory of the human visual system affects visual attention processes, we take inspiration from the self-attention mechanism [16]. We note that the dot product attention inside the self-attention mechanism can be used to establish long-range spatio-temporal interactions between features at different time steps. As a result, we added MLIA and Transformer, both implemented on a self-attention mechanism, to the model, placed at different levels to generate a global spatiotemporal context. The reason for choosing Transformer is that due to its inherent attention mechanism and Multilayer Perceptron (MLP) structure, Transformer can split input object features into patch tokens to mine the patch structure similarity of related objects. It makes the model pay more attention to the saliency region while establishing long-range dependencies.

We propose a novel approach, the Transformer-based Multi-level Attention Integration Network (TMAI-Net). This innovative model is designed to collectively address the previously mentioned limitations. The encoder for TMAI-Net is a 3D fully convolutional network from S3D [17] that was pre-trained on the Kinetics dataset [18]. 3D convolutional layers have the ability to encode hierarchical spatio-temporal information, which can encode not only low-level information such as colour contrasts, but also high-level semantic information such as persons. In our model, the 3DCNN encoder extracts four branches from different levels from shallow to deep, which produce different features from low to high levels. We place MLIA and Transformer at the shallowest and deepest levels of the four branches respectively to construct long-distance spatial-temporal interactions through the self-attention mechanism they both have. Besides, we use a two-stream input strategy in our model [19], where the model receives the stacked original video frames together with their appropriate attentional patches as input. We use Transformer to mine the structural similarity of related objects in the extracted global features and attention features to make the model focus more on saliency regions. Overall, the following are our main contributions:

  • We propose a Transformer-based Multi-level Attention Integration Network (TMAI-Net) for video saliency prediction, which introduces the self-attention mechanism to compensate for the limitations of existing 3D CNN-based models.

  • We present a Multi-Level Interactive Attention (MLIA) module to capture long-range spatio-temporal relationships between time steps. The MLIA module utilize the self-attention mechanism at the pixel level to predict human visual attention.

  • The MLIA module and Transformer module are placed in the shallowest and deepest levels of the four branches extracted from the model’s encoder, respectively, and directly establish the global spatio-temporal context at most levels. Besides, the structural similarity among related objects in global and attention features is mined through the Transformer, which helps the model to focus more on saliency regions.

2 Related work

2.1 Recent video saliency prediction models

The traditional video saliency prediction model combines both static and motion information and uses hand-designed spatio-temporal features for saliency modeling [20, 21]. Given the swift development of deep learning methodologies and the accessibility of extensive dynamically annotated datasets, such as DHF1K [11], deep learning-based video saliency prediction models have prevailed, demonstrating notable superiority over conventional models. Bak et al. [6] proposed a two-stream convolutional neural network that takes video frames and corresponding optical flow maps as input, merging spatial and temporal streams to produce salieny maps. Li et al. [22] presented a precise end-to-end learning framework for video saliency prediction, which collects motion information through optical flow and enhances temporal coherence by employing LSTM networks to encode sequence features. Wang et al. [23] proposed ACLNet, which enhances the CNN-LSTM architecture with an attention mechanism, facilitating rapid end-to-end saliency learning and enabling temporal saliency representation across consecutive frames via LSTM. Liu et al. [24] designed a novel saliency detection algorithm that effectively transfers image reconstruction knowledge to the learning process of saliency detection. Liu et al. [25] designed a new scene-guided two-branch network for salient object detection that allows cross-task knowledge distillation from scene classification. Liu et al. [26] extended residual pose routing to saliency prediction, which improved the computational efficiency while reducing the parameters. In addition, some methods introduce the nature of the study of part-whole relationships into salient object detection and use multi-flow strategies and confidence scores to improve the model’s ability to segment salient objects [27,28,29]. Moreover, 3D convolutional architectures have been investigated for video saliency prediction tasks. Min et al. [13] proposed TASED-Net, a 3D convolutional encoder-decoder model designed to concurrently manage temporal information while extracting spatial features. Furthermore, Xue et al. [19] proposed a novel 3D convolutional encoder-decoder network named ECANet, which proposes explicit cyclic attention for temporal modeling and pixel emphasis. ViNet is an innovative visual architecture that includes an auditory module to investigate the fusion of audiovisual cues in video saliency prediction tasks [30]. STA3D is a spatio-temporal attention 3D network that selectively propagates saliency temporal features and refines spatial features for video saliency prediction [31]. The above video saliency prediction model is constructed based on 3DCNN. However, the 3D convolution operation only produces local receptive fields and ignores feature dependencies at long distances.

In our work, our model uses MLIA and Transformer to model long-range spatio-temporal relationships at different levels, directly constructing global context and enabling interactive attention across spaces and scales.

2.2 Attention mechanism

In a variety of computer vision [32, 33] and natural language processing tasks [34, 35], attention mechanisms have recently shown exceptional effectiveness. The input labels are initially transformed into queries, keys, and values at the embedding layer in the conventional attention mechanism. Then, dot product attention is used to compute the long-range relationships among the labels in the input sequence. In the domain of computer vision, Oh et al. [36] proposed a temporal memory network utilizing an attention mechanism for video image segmentation, effectively capturing long-range dependencies between current and past frames while maintaining memory updates for enhanced performance. Yuan et al. [16] proposed a novel feature pyramidal interactive attention network for egocentric gaze prediction, which leverages an attention mechanism to facilitate interactive attention across space and scale, effectively capturing long-term relationships among spatio-temporal features at various time steps. Wang et al. [37] proposed the STSANet to model long-range spatio-temporal relations separately for different levels of information extracted, and then fuse the spatio-temporal features of different levels to output saliency maps. Zhang et al. [38] designed an attention-guided mechanism that adaptively learns adjacent feature fusion weights to perform better learning of multiscale spatio-temporal features.

The above models use self-attention mechanism to model long-range spatio-temporal dependencies, which compensates for the limitations of 3DCNN in modeling local spatio-temporal features. Besides, given the different perceptual fields and the different richness of semantic information at each level, it is also necessary to consider the differences in the way of modeling the global dependencies at different levels. In our module, in order to better utilize multi-level features, maximize the accuracy of the model, and reduce the complexity of the model, we place MLIA and Transformer in the shallowest and deepest levels of the four branches extracted by the model encoder, respectively. We leverage the self-attention mechanism in MLIA and Transformer to model long-range spatio-temporal dependencies.

2.3 Visual transformer

The advent of the Transformer architecture has considerably accelerated development in the areas of computer vision and natural language processing (NLP). It has gained widespread adoption in tasks encompassing classification [39,40,41,42], segmentation [43], and detection [44,45,46,47], consistently delivering outstanding performance. Compared with the limited CNN modeling of local spatio-temporal feature, the Transformer has been used for the modeling of global spatio-temporal feature due to its inherent self-attention mechanism. Liu et al. [48] proposed a Video Swin Transformer based on spatio-temporal local sensing bias, which is a network designed based on Swin Transformer [49], which computes self-attention correlations by processing spatio-temporal inputs through a window shifting mechanism. Ma et al. [50] applied the Transformer to video saliency prediction and obtained the temporal dependence of input past frames and target future frames by the Transformer and achieved excellent performance. Wang et al. [51] proposed a novel saliency detection algorithm applied to optical remote sensing images, which exploits the advantages of both CNN and Transformer to better extract local and global contextual information in complex scenes. Su et al. [52] proposed a unified Transformer framework for group-based segmentation, through which long-range dependencies between image features are captured and patch-structural similarities between related objects are explored. Zhou et al. [53] used Video Swin Transformer as a model backbone to generate multilevel spatio-temporal features with rich contextual cues.

In our module, Transformer is used to find out how structurally related the global features and attention features are. The Transformer module’s inherent ability to interact with global features allows it to be used in combination with our proposed MLIA module to simulate long-range spatio-temporal dependencies at various information levels, significantly enhancing the model’s performance.

3 Approach

3.1 Overall architecture

TMAI-Net is a two-stream structured model based on the Transformer module and MLIA which built by 3DCNN to model the human visual attention processes. Our model predicts the frame-by-frame saliency map by means of a sliding window. For a video with a total number of frames N, the saliency prediction of any frame in the video is achieved by considering a fixed number of consecutive past frames, which is referred to as T in our work. The input to this model is the original video frame F and the corresponding attentional patches A. In other words, \(S_t\), a saliency map at t, is predicted given an input clip \(\left( F_{t-T+1}, \ldots , F_t\right) \) and attentional patch \(\left( A_{t-T+1}, \ldots , A_t\right) \) for any \(t \in \{T, \ldots , N\}\) where \(F_t\) is the frame at time step t and \(A_t\) is the attention patch at time step t. \(S_t\) can be calculated using the following equation:

$$\begin{aligned} S_t=\left\{ \begin{array}{l} \textrm{Q}\left( \left\{ F_{t+T-i}\right\} _{i=1}^T,\left\{ A_{t+T-i}\right\} _{i=1}^T\right) \cdots t \in (0, T) \\ \textrm{Q}\left( \left\{ F_{t-T+i}\right\} _{i=1}^T,\left\{ A_{t-T+i}\right\} _{i=1}^T\right) \cdots t \in [T, N] \end{array}\right. \end{aligned}$$
(1)

where Q represents our TMAI-Net model.

Fig. 1
figure 1

Detailed structure of the TMAI-Net. The model contains two encoders (global encoder and attentional encoder) and a decoder, the Multi-level Interactive Attention (MLIA) module, and the Transformer module. The encoder encodes the input of the model to generate multi-level spatio-temporal features corresponding to four branches. MLIA and Transformer are placed at the shallowest and deepest levels of the four branches respectively to establish long-term spatio-temporal dependencies with different time steps. Then, the saliency map is generated by the decoder

The proposed model’s structure and design are illustrated in Fig. 1. Our model employs the full convolutional component of the S3D network [17], which has been pre-trained on Kinetics data[18], as its foundational architecture. In the S3D network, our model implements the encoding of multi-level features by 3DCNN. For the four branches output from the backbone part of the model corresponds to low-level features and high-level features, respectively. At the lowest level (closest to the pixel) of the model we add the Multi-level Interactive Attention (MLIA) module based on a self-attention mechanism that implements long-range spatio-temporal relationships modeling. In addition, we add the Transformer module to mine the structural similarity associated with each video frame and the corresponding attention patch of the model input, which helps the model to focus more on the salient regions. Besides, the Transformer works together with the MLIA module to implement global spatio-temporal feature modeling at different levels. Lastly, the features output by the MLIA module and the Transformer module are combined and decoded in the decoder to output the saliency map.

3.2 Encoders and decoder

The encoder and decoder architecture used in [13] provides the basis of our TMAI-Net. To implement temporal modeling and pixel emphasis we introduce the explicit circular attention mechanism proposed by [19]. Both encoders of TMAI-Net use the S3D network as the basic structure. After that, we fuse the global features and attention features and input them into the decoder. The decoder performs spatial decoding of features while jointly aggregating temporal information to produce saliency maps.

The global encoder of TMAI-Net uses the S3D network as the underlying architecture, removing the final pooling, convolutional and fully connected layers to extract spatio-temporal features from multiple levels. The output \(\left\{ C_g^i\right\} _{i=1}^4\) of the four branches of the global encoder can be calculated using the following formula.

$$\begin{aligned} C_g^i=\left\{ \begin{array}{ll} R\left( F * w_g^i+b_g^i\right) &{} i=1 \\ R\left( C_g^{i-1} * w_g^i+b_g^i\right) &{} i \ne 1 \end{array},\right. \end{aligned}$$
(2)

where \(i \in (0,4]\); \(R(\cdot )\) represents the ReLU activation function, and w and b represent the weight and bias of the ith branch, respectively.

The attentional encoder of TMAI-Net reduces the max-pooling layer next to and inside the first convolutional layer compared to the global encoder. For the specific convolutional layers in the S3D network see [17].The output \(\left\{ C_a^i\right\} _{i=1}^4\) of the four branches of the attentional encoder can be calculated using the following formula.

$$\begin{aligned} C_a^i=\left\{ \begin{array}{ll} R\left( A * w_a^i+b_a^i\right) &{} i=1 \\ R\left( C_a^{i-1} * w_a^i+b_a^i\right) &{} i \ne 1 \end{array},\right. \end{aligned}$$
(3)

The parameters in the formula for the attentional encoder are similar to those of the global encoder. At the lowest and highest level of the four branches extracted from the encoder we placed the MLIA and Transformer architecture for establishing long-range spatio-temporal interactions, respectively.

3.3 Multi-level Interactive Attention (MLIA) module

Video saliency prediction entails the challenge of emulating human visual attention in dynamic scenes, necessitating a comprehensive grasp of the contextual information present within the video. Such a task requires not only combining multiple levels of semantic information, but also capturing long-range relationships between visual features at different moments. In our model, after input pass through the model’s backbone, the temporal channel dimension of the multilevel features is compressed to 4 by the 3D convolutional layer. Given the persistent limitations in existing video saliency prediction models concerning the learning of spatio-temporal feature correlations and saliency region identification, we propose a novel approach. In our model, we incorporate the MLIA module at the most granular level within the four branches originating from the model’s backbone. This integration aims to effectively model long-range dependencies among time steps within the temporal channel at the pixel level.

We propose a Multi-level Interactive Attention module (MLIA) as shown in Fig. 2. Our MLIA module references the SATA module proposed in [37]. In method [37], the SATA module is placed for use on four branches on the backbone, enhancing the visual characteristics between the different time steps at different levels. However, the SATA module does not work in our approach because we have two feature streams and it does not make sense to consider the update of long-range spatio-temporal relationships for only one feature stream. In this case, it is necessary to consider how the fused auxiliary feature stream is processed by MLIA.

Fig. 2
figure 2

The overall structure diagram of the Multi-level Interactive Attention (MLIA) module

Fig. 3
figure 3

The self-attention sub-layers of the Multi-level Interactive Attention (MLIA) module. (a) Self-attention Layer 1 of MLIA. (b) Self-attention Layer 2 of MLIA

Fig. 4
figure 4

The details of the dot product attention in the Self-attention layer. SF stands for the softmax operation

As shown in Fig. 2, MLIA contains a total of three sub-modules including a feature stream fusion module and two self-attention sub-layers. The inputs to the MLIA module are the outputs of the lowest level in the encoder of the TMAI-Net. In our model, the two input feature streams go through the feature stream fusion module in MLIA and enter into the two self-attention sublayers, which directly capture the long-range relationship between spatio-temporal features at different time steps through the dot product attention. The input of the MLIA module can be denoted as \( L \in \mathbb {R}^{C \times 4 \times H \times W} \), where C represents the semantic channel, 4 represents the temporal channel, and H and W denote the dimensions for height and width, respectively. In Self-attention Layer 1, as illustrated in Fig. 3(a), the input feature L is split into 4 sub-features along the time channel \( \left\{ L^1, L^2, L^3, L^4\right\} \in \mathbb {R}^{C \times 1 \times H \times W} \). We convert the obtained sub-features into queries, keys and values and measure the two-by-two relationships by dot product attention. The global dependencies of spatio-temporal features are captured based on these relational aggregation information. The dot product attention is depicted in Figure 4. The specific computation is as follows:

$$\begin{aligned} D P-{\text {Att}}\left( L_q, L_k, L_v\right) ={\text {Softmax}}\left( \left( L_q\right) ^T L_k\right) \left( L_v\right) ^T \end{aligned}$$
(4)

Where \(D P-{\text {Att}}(\cdot )\) stands for dot product attention, \(L_q\), \(L_k\), and \(L_v\) represent the query, key, and value after sub-feature transformation, respectively. \(Softmax (\cdot )\) is the softmax activation function. Subsequently, they are recombined along the temporal channel. Unlike the Self-attention Layer 1, the Self-attention Layer 2, as depicted in Fig. 3(b), the input features are partitioned into two sub-features \(\left\{ L^{12}, L^{34}\right\} \in \mathbb {R}^{C \times 2 \times H \times W}\). The subsequent operation is the same as the Self-attention Layer 1, where the long-range dependence of spatio-temporal features is established by dot product attention. By computing the dot product attention in two sub-layers of MLIA, it is possible to realize that any one of the four features \(\left\{ L^1, L^2, L^3, L^4\right\} \) can be updated directly by the long-range spatio-temporal relationship with the other three features.

In addition, the fusion of the two input feature streams needs to be implemented in the MLIA. The fused feature stream \(C_f\) can be computed using the following equation:

$$\begin{aligned} C_f=R\left( \text{ Concate } \left( C_1, C_2\right) * w_f+b_f\right) \end{aligned}$$
(5)

Where \(R(\cdot )\) is the ReLU activation function, w and b represent the parameters corresponding to the weight and bias of the fusion module, \(Concate(\cdot )\) represents concatenation operation of two feature streams, \(C_1\) and \(C_2\) represent the input video stream and the explicit cyclic attention stream, respectively. The MLIA module enables the modeling of spatio-temporal features with different time steps, which greatly improves the performance of the model.

3.4 Transformer block

The Transformer [32] has garnered substantial interest within the realm of computer vision due to its proven ability to capture long-range relationships. In our model, we include the Transformer block at the deepest level of the four branches drawn from encoder, which helps to leverage structural similarities between related objects in the attention and global features to better achieve pixel emphasis. Transformer is able to capture the long-range dependence of spatio-temporal features thanks to the self-attention mechanism. It and MLIA can model the long-term spatio-temporal relationship at different levels, which significantly improves the model performance.

Fig. 5
figure 5

Transformer-based feature interaction module

The module of our Transformer block is shown in Fig. 5. Conventional Transformer models use an encoder and decoder architecture with stacked self-attention layers and point-wise, completely connected layers in both the encoder and the decoder. The encoder in our module consists of a stack of M identical layers, with two sub-layers in each layer. The first sub-layer involves a multi-head self-attention mechanism (MSA), while the second sub-layer consists of a fully connected feedforward network (FFN) equipped with a multi-layer perceptron (MLP). Two sub-layers are joined using residuals and then layer normalized (LN). For the input token \(X \in \mathbb {R}^{C \times L}\), the Transformer processes it through position encoding and two sub-layers of the encoder, as follows:

$$\begin{aligned} x_0=\left[ u_1+p_1, u_2+p_2, \ldots , u_L+p_L\right] \end{aligned}$$
(6)
$$\begin{aligned} x_m^{\prime }={\text {MSA}}\left( {\text {LN}}\left( x_{m-1}\right) \right) +x_{m-1} \end{aligned}$$
(7)
$$\begin{aligned} x_m={\text {MLP}}\left( {\text {LN}}\left( x_m^{\prime }\right) \right) +x_m^{\prime } \end{aligned}$$
(8)
$$\begin{aligned} y=L N\left( x_m\right) \end{aligned}$$
(9)

where \(p_i\) represents the position embeddings for input tokens \(u_i\), y refers to the output. M represents the number of encoders and \(m \in \{1, \ldots , M\}\). MSA builds on the self-attention mechanism, which maps queries and sets of key-value pairs to outputs. Within the MSA, each set of quiries’ attention function is calculated in parallel and combined into a matrix called Q. Similarly, matrices K and V, containing the keys and values, are packed accordingly. The MSA is calculated as follows:

$$\begin{aligned} {\text {Self}}-{\text {Attention}}(Q, K, V)={\text {softmax}}\left( \frac{Q K^T}{\sqrt{d_k}}\right) V \end{aligned}$$
(10)
$$\begin{aligned} H_i={\text {Self}}-{\text {Attention}}\left( Q W_i^Q, K W_i^K, V W_i^V\right) \end{aligned}$$
(11)
$$\begin{aligned} {\text {MSA}}(Q, K, V)={\text {Concat}}\left( H_1, \ldots , H_h\right) W^O \end{aligned}$$
(12)

where the parameter matrices \(W_i^K \in \mathbb {R}^{d_{\text{ model } } \times d_k}\), \(W_i^Q \in \mathbb {R}^{d_{\text{ model } } \times d_k}\), \(W^O \in \mathbb {R}^{h d_v \times d_{\text{ model } }}\), and \(W_i^V \in \mathbb {R}^{d_{\text{ model } } \times d_v}\). \(d_k\) represents the dimension of the query, \(d_v\) represents the dimension of the value, and \( d_{\text{ model } } \) signifies the number of dimensions in the output generated by all sub-layers and embedding layers within the model. The decoder consists of M identical layers stacked like the encoder. The decoder adds a third sub-layer, which is utilized for multi-head attention on the encoder stack’s output, in contrast to the encoder. Through the incorporation of the attention mechanism, the Transformer effectively captures long-range dependencies and alleviates the computational challenges associated with calculating self-attention relationships for extensive image and video entities.

In our module, both encoders use the S3D network as the basic structure and generate four distinct branches \(\left\{ C_g^i\right\} _{i=1}^4\) and \(\left\{ C_a^i\right\} _{i=1}^4\), respectively. We fuse the \(C_g^4\) and \(C_a^4\) from the 4-th branch of the two encoders into \(C_f\). For \(C_f \in \mathbb {R}^{B \times C \times 4 \times H \times W}\), in the Transformer module, where B represents the batch size and C, 4, H, and W stand for the semantic channel, temporal channel, height, and width, in that order. Concretely, we first reshape \(C_f\) into a sequence flattened tokens \(C_t \in \mathbb {R}^{U \times B \times C}\), where \(U=4 * H * W\). All these token \(C_t\) are then entered into the Transformer block.

$$\begin{aligned} C_y= \text{ Transformer } \left( C_t\right) \end{aligned}$$
(13)

where \(C_y\) represents the output of the Transformer block. Afterward, we reshape \(C_y\) back into an feature map \(C_y^{\prime } \in \mathbb {R}^{B \times C \times 4 \times H \times W}\).

4 Experiments

4.1 Datasets

DHF1K [23]: DHF1K is a comprehensive dataset tailored for dynamic free gaze prediction. It consists of videos with a 1000 frame-per-second frame rate and \(640 \times 360\) resolution that feature a wide variety of scenes, actions, activities, and more. Each frame of the video was viewed from 17 observers. In the DHF1K dataset, 1000 videos are divided into 600 training sets, 100 validation sets and 300 test sets.

Hollywood-2 [54]: Hollywood-2 stands as one of the most extensive and demanding datasets accessible in this domain. It contains 1,707 videos featuring human behavior in Hollywood movies and the corresponding 19 viewers labeled with ground-truth saliency maps. These videos encompass a wide range of categories, including activities such as phone calls, driving, exiting a vehicle, handshakes, and running. The Hollywood-2 dataset has been divided into two sets for training and testing: a test set containing 884 sequences and a training set with 823 sequences.

UCFSports [54]: UCFSports contains 150 videos of various sports action categories such as diving, golf swing, weightlifting, horseback riding, walking, etc. Annotations for these videos have been gathered through a task-driven approach. Following the segmentation of UCFSports, 103 videos in total were assigned to the training set, while 47 videos constituted the test set.

4.2 Metrics

To ensure a more precise and equitable assessment of the model’s performance, we employ five widely recognized metrics, in line with established prior research [13, 19, 54]: NSS, SIM, CC, AUC-J, and s-AUC. Specifically, NSS is used as a simple correspondence measure between the salient and true graphs to estimate the linear correlation between the anticipated result and the true gaze graph. CC is used to assess the degree of correlation or dependence between the significant plot and the gaze-point plot, also known as the linear correlation coefficient. SIM is a tool for comparing how similar two distributions are. AUC-J and s-AUC are both variants of AUC, they are both computed using binary maps of gaze points, and both are commonly used metrics for assessing saliency maps. Higher scores for the aforementioned measures indicate that the model is doing better.

4.3 Experimental setup

TMAI-Net is implemented based on Python and Pytorch [55] framework. For the input of the model, we selected a sliding window T size of 32. Every input frame has a size of \(224 \times 384\), and the attention patches that match the input frames have a size of \(28 \times 48\). The entire model is trained using the SGD optimizer. We set the model’s learning rate to 0.001 and use an early stop strategy on the validation set to prevent overfitting during the training phase. The total number of iterations during training is set at 4000, and the batch size is set at 30.

First, we train our model using the DHF1K training set. We must benchmark the model’s results on the test set online because DHF1K retains the test set’s annotations. Then, we trained the models on Hollywood-2 and UCFSports, respectively. Since DHF1K contains more types of objects and scenes compared to Hollywood-2 and UCFSports, it has advantages in diversity, scalability, and generality of datasets, so we mainly evaluate the model on DHF1K and use the remaining two datasets as supplements.

In video saliency prediction tasks, the Kullback-Leibler divergence (KLD) [7] has demonstrated its efficacy as a loss function and is extensively employed in the field [13, 19, 54, 56]. Following are the calculations for the KL loss function:

$$\begin{aligned} K L(S, G)=\sum _i G_i \log \left( \epsilon +\frac{G_i}{\epsilon +S_i}\right) \end{aligned}$$
(14)

Where S signifies the predicted saliency map, G represents the ground truth. \(\epsilon \) denotes the regularization constant.

4.4 Comparison with the state-of-the-art models

We perform a quantitative comparative analysis of our TMAI-Net against 12 state-of-the-art video saliency prediction models, which encompass STSConvNet [6], SALICON [7], OM-CNN [10], ACLNet [11], STRA-Net [56], TASED-Net [13], SALSAC [12], UNISAL [15], ECANet [19], ViNet [30], STSANet [37], and TMFI-Net [53]. This evaluation is carried out using the DHF1K [23] dataset as well as the test sets of Hollywood-2 [54] and UCFSports [54]. We selected the aforementioned models after analyzing their structure. TASED-Net was chosen because we adopted this model as the basic structure of TMAI-Net. ECANet is chosen because we use its proposed two-stream input approach based on explicit cyclic attention. STSConvNet,OM-CNN, and SALICON were chosen because they are all two-stream structures like TMAI-Net. STSANet and TMFI-Net were chosen because they are currently the two best-performing models in the field of video saliency prediction. The other selected models are also cutting-edge models in the field of video saliency prediction at present. By evaluating our models with the above models, we can fairly, comprehensively, and accurately assess the efficacy of TMAI-Net.

Table 1 Results of a quantitative comparison between TMAI-Net and other state-of-the-art models using the DHF1K, Hollywood-2, and UCFSports test sets
Table 2 Model size (MB) and average prediction time comparison of video saliency methods

Quantitative. Table 1 presents the quantitative results for these models across the five metrics. Comparing our TMAI-Net against the other 12 state-of-the-art models, it obtained the top 3 results in 4 out of 5 metrics, suggesting that it is competitive in the DHF1K dataset. Although our model is not currently the best performing in the field of video saliency prediction, we far outperform the most advanced models in terms of computational efficiency (model size and running speed). Table 2 shows specific experiments on computational efficiency. The experimental results obtained from the Hollywood-2 and UCFSports datasets illustrate that our TMAI-Net outperforms both TASED-Net and ECANet across all metrics. This illustrates the value of our model.

To assess the model’s performance on the Hollywood-2 and UCFSports datasets more fairly and accurately, we also need to make some additional adjustments. Our model simulates the behavior of the human visual attention mechanism in modeling virtual memories, which requires sufficient temporal information. Nonetheless, it’s important to note that a few samples within these two datasets lacked an adequate number of frames for our model to effectively model virtual memories. Additionally, certain samples within Hollywood-2 and UCFSports consisted of only one or two images, which resulted in outcomes that didn’t meet our anticipated performance levels. To address this issue, we selected samples from both datasets with more than 64 frames to reevaluate our model so that our model has enough frames to learn saliency patterns and model virtual memories in human visual scenes. The results of the model re-evaluation based on this method are shown in Table 1, and “TMAIlong” is the result of the experiment. The outcomes demonstrate that, in both the Hollywood-2 and UCFSports datasets, our model can rank in the top 3 for a wide range of metrics.

Qualitative. In Fig. 6, we have selected some representative advanced video saliency prediction models for qualitative comparison with our proposed model, including STRA-Net [56], TASED-Net [13], and ViNet [30]. We use the DHF1K dataset’s validation set to qualitatively evaluate our model. The aforementioned procedures are carried out on identical hardware setups to guarantee the impartiality of the evaluation.

Fig. 6
figure 6

Qualitative comparisons on several video categories, each sampling three frames for display, between the TMAI-Net and other state-of-the-art video saliency models

The experimental results show that our model performs more accurately than other models. In Fig. 6 (a), a woman is pushing a floating ball with a child into the pool. In this case, due to the interaction between the woman and the ball, attention will transition from the woman to the floating ball, and the ground-truth maps record this attention shift. Among all models, only TMAI-Net and TASED-Net predicted the attentional shift from the woman to the floating ball. Compared with TASED-Net, the prediction result of TMAI-Net is more accurate. In Figure 6 (b), a man is cutting a stone with a machine. As he moves the tray to cut the stone, the ground-truth map shows that the viewer’s attention shifts from the machine to the man. Among all the models, only TMAI-Net provides the most accurate prediction. In Figure 6 (c), a large dog is walking toward a small dog in the shot, and as time passes, the ground-truth map shows the viewer’s attention moving from the large dog to the small dog. Among all the models, the prediction result of TMAI-Net is closest to the ground-truth map. In Fig. 6 (d), two men are fishing and one of the men casts his rod into the water. As time passes, attention transitions from the men to the rod and subsequently to the fish. In this case, only TMAI-Net accurately predicts this process, ViNet ignores the attention transfer process, and TASED-Net and STRA-Net pay unnecessary attention to the fishing boat.

Computational load. We compare the computational efficiency of our model to a number of state-of-the-art models, such as ACLNet [11], STRA-Net [56], TASED-Net [13], SalSAC [12], UNISAL [15], ViNet [30], STSANet [37], and TMFI-Net [53], in order to assess the computational efficiency of the model. Table 2 makes it evident that our model exhibits competitiveness. In addition, our model uses a two-stream input, and for a \(224 \times 384\) video frame and its corresponding \(28 \times 48\) attention patch. our model takes about 0.018s. The computational efficiency of our model is significantly better than that of the two most advanced video saliency prediction models STSANet and TMFI-Net.

4.5 Ablation studies

In this section, we use the DHF1K dataset to conduct an ablation analysis of TMAI-Net. First, various variants of the model were built in order to more thoroughly examine the applicability of each component. Specifically, our model offers three settings, including “Two-stream”, “Two-stream+MLIA” and “Two-stream+Transformer”. “Two-Stream” means that a two-stream network is constructed using the 3D backbone as a baseline. “Two-stream+MLIA” means adding the MLIA to “Two-stream”. “Two-stream+Transformer” means to add Transformer module to “Two-stream”. Table 3 displays the outcomes of the ablation experiments. “Ours” refers to our model TMAI-Net.

The Contributions of the MLIA module. The difference in results between “Two-stream” and “Two-stream+MLIA” in Table 3 shows that the addition of the MLIA to “Two-stream” improves the performance of four (AUC-J, s-AUC, CC, NSS) of the five metrics. This demonstrates that the MLIA can indeed bring performance improvement to our model.

The Contributions of the Transformer module. Comparing the performance of “Two-stream” and “Two-stream+Transformer”, we can see that the performance of three(AUC-J, CC, NSS) of the five metrics improves when “Two-stream” is added with the Transformer module. This finding implies that the Transformer module in the model does bring performance gains.

The Contributions of the MLIA module and Transformer module. Comparing the performance of “Ours” and “Two-stream+MLIA”, we can see that four (SIM, s-AUC, CC, NSS) of the five metrics have improved when the Transformer module is added to “Two-stream+MLIA”. In addition, comparing the performance of “Ours” and “Two-stream+Transformer”, we can see that four (SIM, s-AUC, CC, NSS) of the five metrics have performance gains when the MLIA is added to “Two-stream+Transformer”. This proves that both the MLIA and the Transformer module bring performance gains to the model.

Table 3 Ablation study on MLIA and and Transformer module

Ablation Study on MLIA. Several variants of the MLIA module were created in order to further validate its contribution to the model. In our model, the MLIA module is placed at the lowest level (closest to the pixel) of the four branches extracted from the encoder to establish long-range spatio-temporal dependencies at the pixel level. In order to study and demonstrate the importance of pixel-level spatiotemporal dependence, we designed three Settings, as shown in Table 4, including “Setting 1“,” Setting 2 “and” Setting 3”, which represent placing the MLIA at levels 2, 3 and 4 of the four branches of the model, respectively. “setting4” stands for our MLIA. The experimental results showed that the metrics of “setting4” were significantly better than those of other MLIA module variants. This demonstrates the superiority of placing the MLIA module at the lowest level of the model to establish long-range spatio-temporal relationships at the pixel level.

Table 4 Ablation study on MLIA module

The visual comparative analyses in ablation studies. In order to more intuitively illustrate the effectiveness of each component in the model, we conducted a visual comparative analysis experiment on the variants of each TMAI-Net designed in the ablation experiment, and the experimental results are shown in Fig. 7. The experimental results show that the structure in TMAI-Net is more accurate than other variants. As shown in Fig. 7(a) and (b), only TMAI-Net correctly indicates that the saliency region is on the big dog’s head and not its belly in frame 3 in (a), and in (b), all variants except TMAI-Net incorrectly labelled the saliency region on the fishing boat. The same difference in results is shown in (c) (d) as in (a) (b).

Fig. 7
figure 7

The visual comparative analyses in ablation studies

4.6 Failure cases and analyses

While our model generally demonstrates strong performance, there are instances where it encounters challenges. Here, we present a few scenarios that TMAI-Net was unable to handle, along with their resolutions.

The failure cases illustrated in Fig. 8 highlight instances in which our model did not yield satisfactory prediction results. 1) The area of saliency is particularly similar to the background in terms of color or texture. In Fig. 8(a), since the swimming fish and the deep sea’s blue backdrop resemble one another quite a bit, it is challenging for TMAI-Net to determine the exact saliency region. 2) Accurate prediction of saliency results for human postures. In Fig. 8(b), for the dancer’s dancing limbs or head, TMAI-Net cannot accurately predict the viewer’s area of attention while watching the dancer dance. 3) Blurred saliency areas in the shot. In Fig. 8(c), a man is teasing a dog with a dog teaser. When the dog teaser is thrown, the ground-truth map shows that the viewer’s attention should be drawn to the distant dog teaser. The ground-truth map reveals that the viewer’s focus is diverted to the far-off dog when it goes to pick up the faraway stick. However, TMAI-Net’s prediction results in the attention being drawn to the man in the near distance. 4) There are a lot of things that are moving in the video. There are a lot of moving persons in the movie in Fig. 8(d). The person whose motion is most obvious draws the attention of the human eye. However, the results of the computational model are scattered to the others. The large number of moving objects in the video makes it difficult for the model to generate accurate predictions.

By analysing the above failure cases, we found that these scenarios are not common in the dataset. We can add more similar cases in the training set to improve the accuracy of the model in these scenarios. In addition to this, for cases (a) and (b), there exist specialised directions in object detection to be researched, i.e., video camouflage object detection and human pose estimation. In order to improve the accuracy of case (a) and case (b), we can introduce the solution of these two directions in our model in the future. For cases (c) and (d), the complex and changing scene and multiple moving objects make it difficult for the model to accurately localise the saliency region. This is because TMAI-Net can only infer saliency results from a video clip and cannot understand the contextual information in the video as humans do, and it is difficult for the model to capture the interactions between moving objects. How to make the model understand the contextual information conveyed by the video more accurately is a worthwhile research problem in our future work.

Fig. 8
figure 8

Several cases of our method failing on the DHF1K dataset

5 Conclusion

In this paper, a novel Transformer-based Multi-level Attention Integration Network (TMAI-Net) for video saliency prediction is proposed. We propose a Multi-level Interactive Attention (MLIA) module that can capture long-range dependencies among the temporal channel’s time steps. In the MLIA module, the long-range spatio-temporal dependency update is achieved by computing dependencies at different time steps using the self-attention mechanism. Based on Transformer’s inherent global feature interaction capabilities, we add MLIA and Transformer to the shallowest and deepest of the model’s four branches extracted from the encoder, directly establishing the global context at different levels. Furthermore, for the model’s two-stream input, we add the Transformer module to mine their structural similarity between related objects and make the model focus more on saliency regions in the video. Comprehensive experiments have demonstrated that TMAI-Net is competitive with current state-of-the-art approaches. The ablation experiment’s findings offer convincing proof of the efficiency of each TMAI-Net component.