1 Introduction

When viewing a scene, humans usually focus on some salient regions. This is a very useful mechanism working in our brain but hard to simulate with computer system. The area of saliency detection on static images or dynamic videos has been receiving more and more attention over the past few years (major achievements can be found in [2, 3]). Properly generated saliency maps can be useful for many applications, such as object location, detection, recognition, image or video compression. We will focus on video saliency detection in this paper.

The saliency information of video exists not only on each separated frame but also between successive frames. In fact, the later one could be even more important in lots scenarios. Thus we cannot simply take image saliency of each frame as the final video saliency. To solve this problem, most existing video saliency detection models are based on spatiotemporal mechanism. They detect spatial saliency on single video frame and temporal saliency on inter-frame distinctiveness. The final saliency map is generated by fusing the spatial and temporal saliency maps together. Kim et al. [14] detected the spatial saliency with edge and color orientation information and the temporal saliency with absolute inter-frame distinctiveness, both based on center-surround framework. The final saliency is generated by linearly combining the spatial and temporal saliency with a fixed weight for each other. Instead of detecting saliency for all frames, Tapu et al. [24] detected saliency only for key frames extracted from tiny segments of the video. The spatial saliency is calculated with regional color information, the temporal saliency is calculated by detecting corresponding interest points between each key frame and its adjacent frames, and the final saliency is measured by using motion contrast to combine the spatial and temporal saliency. Li et al. [19] detected spatial saliency by computing color information of edge preserving super-pixels, which are extracted with advanced Turbopixels [18]. For temporal saliency, they used the same mechanism but on optical flow information of the video. The spatial and temporal saliency are then transmitted into conditional random field [17] to label each pixel. Li et al. [20] detected regional saliency for video frames. To segment each frame properly, the fast mean-shift process is performed on the spatial and temporal features computed from the color, texture and optical-flow information. The regional saliency is calculated by measuring the dissimilarity of each region with its neighbor regions. Afterward, these dynamic regions are matched in temporal domain to construct a temporally coherent regional saliency map. Rudoy et al. [23] detected video saliency with not only spatial and temporal information but also semantic information, and only computed saliency for some selected candidate gaze locations because saliency in video is very sparse according to their observation. Kim et al. [15] proposed an unified spatiotemporal saliency detection framework for both image and video based on textural contrast and motion stimuli. As this model focus on the contrast visual stimuli which can greatly eliminate unwanted details, it performs well even in complex scenes with highly textured backgrounds. Fang et al. [7] measured spatial saliency by extracting intensity, color, and texture features from DCT coefficients, then detected temporal saliency using motion feature in compressed domain, and designed a new fusion method to obtain the final saliency maps.

One main issue of spatiotemporal mechanism is computation redundancy. Most frames in a video have great relevance with their neighbors, thus saliency in video is very sparse [20]. Independent computing on every pixel of each frame is redundant. To address this issue, some researchers did not follow the above mentioned framework. Rapantzikos et al. [22] took video as a spatiotemporal volume, and detected video saliency by measuring local contrast for each visual unit. Duncan et al. [5, 6] built a model based on the theory of information entropy, that geometrically organized regions have lower entropy than disorganized regions. With Weighted Parzen Windows, they obtained the Renyi entropy of probabilistic relational distribution, based on distance and gradient direction relationships between pixels. This model emphasized biological plausibility in saliency detection process, while neglected the usefulness of color information.

Another even bigger issue of spatiotemporal mechanism is accuracy. As we know, human visual system is more sensitive to motion information than others. When building spatiotemporal saliency model, most researchers gave temporal saliency maps larger weight in the fusion process. But when the scene is static or has no significant motion, visual attention will be attracted by spatial information, and spatial saliency should be given larger weight. The problem is that, no matter which saliency is more important, we actually have no idea what exactly the weight value should be. To solve this problem, researchers should use some proper mechanism to evaluate how powerful the motion contrast could be. Although it is a common sense that moving target draws people’s attention, it could actually being wrong in some case. For example, leafs on a tree could be moving really fast in a windy day, but people usually still pay more attention to the beautiful bird on the trunk even if it is static. Or sometimes a static object draws more attention just because others are moving faster, like a walking man in the middle of a busy road. Thus no matter a fixed weight or dynamic weight is used to fuse the spatial and temporal saliency, the mechanism will still lose its accuracy due to the complexity of motion.

To overcome the above issues, we firstly give another point of view in video saliency detection – distinctiveness. Let’s assume that people pay attention to distinctiveness while viewing a scene. This is also understandable as a common sense. When human visual system is attracted by moving object, we actually are attracted by the distinctiveness between frames, which is caused by the moving object. So does the distinctiveness in single frame, which is caused by color, shape or other features. That is to say, we can use distinctiveness to measure the saliency of each pixel. If the distinctiveness is larger, the saliency value should be higher and vise verse, no matter the distinctiveness is caused by spatial or temporal information.

In this paper, we propose a novel video saliency detection model, which regards the input video as three-dimensional data. Instead of using the input video straightly like Duncan et al. did in [5, 6], the video is firstly decomposed by 3D discrete shearlet transform to obtain multi-scale description. The reason of using shearlet based decomposition is to provide multi-scale analysis for saliency detection. The shearlet transform was originally introduced by Guo et al. [10]. It was derived within composite wavelet, which makes shearlets a truly multivariate extension of the wavelet framework. The use of shearing to control directional selectivity allows a single or finite set of generators to define shearlet systems. Although directional multi-scale systems have emerged years ago, only recently these representations have been extended beyond dimension 2. The extension of shearlet from 2-D to 3-D makes it possible for shearlet transform to analysis and process 3-D data sets like video. The 3-D shearlet representation is a multi-scale pyramid of well-localized wave-forms defined at various locations and orientations. It is introduced to overcome the limitations of traditional multi-scale systems while dealing with multi-dimension data. It has some good properties like parabolic scaling, directional sensitivity and spatial localization property. These are useful for saliency detection when describing the video frames in multi-scales and outstanding regions from their surroundings.

Instead of combining information of two dimensional spatial domain and one dimensional temporal domain, the proposed model is built on information of three-dimensional block. We actually need to deal with each video frame only once while at least twice for spatiotemporal mechanism. Most existing spatiotemporal saliency models make use of motion information between two frames, this would cause loss of long-term motion information. When viewing a scene, a number of successive frames would influence people’s visual system, which is different from single images viewed independently. Taking this into consideration, we take one more step by processing the video per segment. Every frame is detected according to the feature maps of the segment it belongs to. As the proposed model could process more information, the detecting result could be more accurate as the experiments will show you later in this paper. What’s more, it is no longer necessary to calculate meaningful weights as in fusing process of spatiotemporal mechanism.

2 The proposed saliency detection model

The proposed saliency detection model detects video saliency based on 3D discrete shearlet transform. Instead of using RGB color space for video saliency detection, all the video frames are converted to the Lab color space. Then each color channel of the converted video is decomposed with 3D discrete shearlet transform (3-D DShT) as presented in [21]. After de-noising the obtained shearlet coefficients matrixes, feature blocks are generated by performing inverse shearlet transform on each decomposition level. On each feature block, global contrast is used to calculate saliency value. Then a saliency block is obtained for corresponding level. By fusing all the saliency blocks together, the final saliency value is calculated for each pixel. Thus we build the saliency map for each video frame. Figure 1 illustrates the overview of this novel saliency detection framework.

Fig. 1
figure 1

Overview of the proposed saliency detection framework

As shown in Fig. 1, there are mainly three steps to generate saliency maps for an input video. The first step is to convert all the video frames from RGB color space to Lab color space. In Lab color space, L is the intensity channel, a and b are RG and BY opponent channels respectively. The reason of this conversion is that Lab color space is more similar to human visual system, which would benefit the saliency evaluation mechanism.

The second step is to generate feature maps. First of all, all the converted video frames are resized to m * m. The target video is then segmented into a number of n-frames blocks. The last frame of the last block will be copied to round up if needed. After that, by performing 3-D DShT on the resized video block v 0 , the coefficients are obtained:

$$ \left\{{H}_1^c,{H}_2^c,\dots, {H}_M^c,{L}^c\right\}=SH\left({v_0}^c\right) $$
(1)

where c represents the channel of input video block v 0 as c ∈ {L, a, b}, H c i represents the shearlet coefficient matrix of the i-th level on channel c, M represents the maximum decomposition level, L c represents the scaling coefficients matrix on channel c, SH represents the 3-D DShT process. The coefficient matrixes of different levels have different sizes. For example, when m = 192, n = 96, M = 3, the size of H c1 is 192*192*96, H c2 is 128*128*64, H c3 is 64*64*32, L c is 32*32*16.

The shearlet coefficients corresponding to salient regions are larger than others. For de-noising, we remove non-salient coefficients by setting all shearlet coefficients less than a proper threshold to be 0. And enlarge saliency by setting other coefficients to be w times larger. The formula is:

$$ {H}_{i,j}^{\prime }=\left\{\begin{array}{cc}\hfill w{H}_{i,j}\hfill & \hfill \left|{H}_{i,j}\right|\ge \delta \hfill \\ {}\hfill 0\hfill & \hfill \left|{H}_{i,j}\right|<\delta \hfill \end{array}\right. $$
(2)

where δ is the threshold obtained by Visu Shrink Threshold Function [4].

The shearlet coefficients matrix on each level represents the detailed information of the video at the corresponding level. And the scaling coefficients matrix represents the approximation information at the coarsest resolution. To create the l-th feature block, we perform inverse shearlet transform on all the shearlet coefficients of lower levels (H 1 , H 2 ,…,H l ) and the scaling coefficients L, as to be shown in (3).

$$ {f}_l^c=ISH\left({H^{\prime}}_1^c,{H^{\prime}}_1^c,\dots, {H^{\prime}}_l^c,{L}^c\right) $$
(3)

where ISH represents the inverse discrete shearlet transform process, l represents the decomposition level. Equation (3) creates M feature blocks, each for a decomposition level. And the l-th feature block f c l is the same size as H c l ′.

The third step is to calculate saliency value on each decomposition level, and generate saliency map for each video frame. In this paper, we detect video saliency by measuring the rarity of different regions with global contrast. When we detect image saliency in our former paper [1], the saliency is measured in two aspects: global and local. In this paper, we regards the input video as three-dimensional data. If we use local contrast to measure video saliency, all the distinctiveness would be magnified including the distinctiveness exist in single frame’s background or caused by background movements, thus the non-salient background would be labeled with large saliency values. That is the reason why we only use global contrast to detect the video saliency in this paper.

As we know, video saliency is evaluated by measuring contrast based on features like color, luminance, texture and motion. Figure 2 shows some examples. It can be seen that Fig. 2a can use color information to outstand region P, Fig. 2b can use luminance information to outstand region P, Fig. 2c can use direction information to outstand region P, Fig. 2d can use shape information to outstand region P. No matter which feature works, it can be concluded that the rarity draws attention. Taking Fig. 2a as an example, the reason of P outstanding from its surroundings is that the average color of region P is green, while other regions is red. That is to say, in this example, the smaller area covered by green color draws more attention than larger area covered by red color. Of course, this conclusion is valid if and only if there is no personal reference. Different from the former four examples, Fig. 2e is more complicated. Neither the left darker regions nor the right brighter regions, but the light and dark alternate regions are salient. That is to say, we can’t simply use luminance to outstand region P. Using rarity to explain, as only the alternate region covers both light and dark area (see region labeled by red box), the region P is salient. That is to say, the rarity of the alternate region attracts our attention.

Fig. 2
figure 2

Examples of visual saliency

The mainly difference of video from image is inter-frame motion information. In temporal, whether a region is outstanding from its surroundings depends on whether it shows different motion. If there exist only one salient object in the input video, there are mainly two kinds of motion styles. One is the global motion caused by the background (no matter the background is moving or not), the other is the motion caused by the salient object. In general, the region covered by moving salient object is usually much smaller than the remainder. This conclusion is still valid if there exist more salient objects.

Before calculating the saliency values, we need to define every location. By using all the feature blocks, every location (x,y,t) on the l-th level can be represented by a feature vector f l (x, y, t) as:

$$ {f}_l\left(x,y,t\right)={\left[{f}_l^L\left(x,y,t\right),{f}_l^a\left(x,y,t\right),{f}_l^b\left(x,y,t\right)\right]}^T $$
(4)

Then global contrast is used to generate saliency maps for video frames on each decomposition level. As mentioned above, the saliency of the t-th video frame vf t is not only affected by itself, but also a number of neighbor frames. To be simple, every h successive frames are processed as a whole (h = 4 in this work), and we set ε = (vf t , vf t + 1, …, vf t + h − 1). For every location (x, y, t) ∈ ε, the likelihood of the features can be defined as the probability density handled by a normal distribution as:

$$ {p}_l\left(x,y,t\right)=\frac{1}{{\left(2\pi \right)}^{N/2}{\left|\varSigma \right|}^{1/2}}\times {e}^{-\frac{1}{2}{\left({f}_l\left(x,y,t\right)-\mu \right)}^T{\varSigma}^{-1}\left({f}_l\left(x,y,t\right)-\mu \right)} $$
(5)

where Σ = Ε[(f l (x, y, t) − μ)(f l (x, y, t) − μ)T] is the covariance matrix, μ represents the expectation vector, f l,t represents the feature map on the l-th level for the t-th video frame. The saliency value of location (x,y,t) on the l-th level is defined as:

$$ {S}_l\left(x,y,t\right)=G\left(-{ \log}_{10}{p}_l\left(x,y,t\right)\right)\ast {I}_{k\ast k} $$
(6)

where I k ∗ k represents a 2-D Gaussian low-pass filter (k = 5 in this work), which is employed to get a smoother result. G(⋅) represents used to convert (·) to a grayscale image. It is worth to mention that some of the feature blocks may equal to zero matrix and the determinant of their covariance matrix is zero. This would cause the probability density defined in (5) to be invalid. Thus only the non-zero feature blocks are used to generate feature vectors.

As the size of S l obtained by (6) vary between different levels, the size of saliency maps is interpolated to be the same as first level, so does the quantity of saliency maps. For the t-th video frame, by fusing all levels’ saliency maps together, we obtain:

$$ S(t)={\displaystyle \sum_{l=1}^MN\left({S}_l(t)\right)} $$
(7)

where N(⋅) represents the normalization operator [13], S l (t) represents the saliency map on the l-th level for the t-th video frame.

Goferman et al. [9] pointed out that the location around the focus of attention point should be given larger saliency value than those far away from it. Here, the locations with saliency value larger than 0.8 as in [9] are signed as focus of attention points. The final saliency map is calculated as

$$ {S}_0\left(x,y,t\right)=S\left(x,y,t\right)\times \left(1-d\left(x,y,t\right)\right) $$
(8)

where \( d\left(x,y,t\right)=\raisebox{1ex}{${d}_0\left(x,y,t\right)$}\!\left/ \!\raisebox{-1ex}{$\sqrt{{a_0}^2+{b_0}^2}$}\right. \), d 0(x, y) represents the distance between location (x,y) and the nearest focus of attention point (x 0, y 0), a 0 ∗ b 0 is the size of S(t).

In order to understand how the proposed model can be applied in practice as well as for the reconstruction of the experiments for validation purposes, follows are the overall flowchart and detailed pseudo-code (Fig. 3).

Fig. 3
figure 3

Flowchart of the proposed video saliency detection model

3 Experiments and evaluations

In this section, we evaluate the performance of the proposed model with database from the website of Akisato [8]. This database includes ten videos, named AN119T, BR128T, BR130T, DO01_013, DO01_014, DO01_030, DO01_055, DO02_001, M07_058 and VWC102T. To be simple, V1-V10 are used to represent these ten videos respectively. V1-V10 cover situations with different complexities. For example, V4 and V9 have very pure backgrounds, both are clear blue sky. The background of V5, V7 and V10 are less pure but the global movement can be easily compensated. Part of V8’s background moves violently. It may confuse the detection process when detecting temporal saliency in spatiotemporal framework. V2 and V3 have the most complex moving background, which is difficult for saliency detection [20, 25]. Each video includes about 100 frames and its corresponding ground-truths (see Fig. 4b). Instead of using the original ground-truths, they are firstly converted to binary maps (see Fig. 4c). In the ground-truths, we set pixels of salient regions to be 1, pixels of non-salient regions to be 0, and then the binary maps are built. Fig. 4d shows some saliency detection result of [20]. By comparing with the results of the proposed model (see Fig. 4e), we can find that although the proposed model detects lower saliency on most of the videos, but it gives a cleaner saliency maps since it is more robust to noise. In another word, much less non-salient regions are signed to be salient comparing with the model of [20].

Fig. 4
figure 4

Examples of different approaches

Itti et al. proposed a bottom-up model in 1998 [14]. It is the first model who completely implement and verify Koch and Ullman model [16], and the most classical one who has great influence on the whole visual saliency detection area. By merging motion and flicker feature channels in Itti’s model, Harel further presented an implementation of video saliency detection [12]. We will compare the proposed model with Harel’s work referred as HAREL. Furthermore, Duncan et al.’s model RE in [6], Zhou et al.’s model TM in [26], Hadizadeh et al.’s model SAVC in [11] are also taken in comparison. These models cover several categories of video saliency detection mechanism. RE regards video as three-dimensional data, and introduces information entropy theory. TM and SAVC are state-of-the-art models. Both of them are based on spatiotemporal framework, while SAVC estimates saliency in the DCT domain based on Itti-Koch-Niebur saliency model [8]. Some of the saliency detection results are shown in Figs. 5 and 6, while (a) represents original video frames, (b) represents results of HAREL, (c) represents results of TM, (d) represents results of SAVC, (e) represents results of RE, (f) represents results of the proposed model. For HAREL, TM, SAVC and the proposed model, we show the 10th, 20th, 30th, 40th, 50th, 60th frames’ saliency maps. As each saliency map generated by [6] corresponds to m successive frames (m = 5 in [6]), we take the 2th, 4th, 6th, 8th, 10th, and 12th saliency map for comparing. Take the 4th saliency map as an example, this saliency map is generated on 16th to 20th video frames, thus the 4th saliency map can be taken as these 5 frames’ saliency map when comparing with other models.

Fig. 5
figure 5

Examples of different approaches

Fig. 6
figure 6

Examples of different approaches

By comparing the saliency results of different models stated in Figs. 5 and 6, it is obvious that the proposed model outperforms other four models, especially with complex background (see BR128T and BR130T). From Figs. 5 and 6, we can find that HAREL is more sensitive to noise, and would cause more irrelevant salient regions than the proposed model (see DO01_055 and VWC102T). Besides, HAREL detects saliency based on center-surround mechanism. It would lose the regions inside salient objects, or give lower saliency value for these inside pixels (see AN119T and DO01_014). SAVC and TM perform better on detecting saliency inside salient objects. But these two models are even more sensitive to noise than HAREL, also much more non-salient regions are signed to be salient than the proposed model. For example, the background of DO01_055, DO01_030 and M07_058 is clean sky, which is obvious non-salient. But SAVC and TM still signed lots massive salient regions in these areas. RE is more robust to noise than other three models. But RE is built on local information entropy, which would lose salient regions inside salient objects as HAREL. Besides, RE is inclined to sign non-salient regions around salient objects salient. It would generate more irrelevant salient regions than the proposed model.

Beside comparing different models directly on saliency maps, we further evaluate the performance of the proposed model based on Precision (P for short), Recall (R for short) and F a measure (F a for short) values by using the ground-truths. The definitions of P, R and F a are as follows:

$$ \begin{array}{ccc}\hfill P=\frac{{\displaystyle \sum_{i=1}^N{P}_i}}{N}\hfill & \hfill as\hfill & \hfill {P}_i=\frac{sum\left(gt\ast s\right)}{sum(s)}\hfill \end{array} $$
(9)
$$ \begin{array}{ccc}\hfill R=\frac{{\displaystyle \sum_{i=1}^N{R}_i}}{N}\hfill & \hfill as\hfill & \hfill {R}_i=\frac{sum\left(gt\ast s\right)}{sum(gt)}\hfill \end{array} $$
(10)
$$ {F}_a=\frac{\left(1+a\right)\times P\times R}{a\times P+R} $$
(11)

where gt represents the binary map, s represents saliency map, N is the quantity of images, a is chosen to be 0.3 as most saliency detection models do. Since some of the video frames have not been given ground-truths in the database, we only use the video frames with ground-truths to obtain P, R and F a values (starting from the 16th video frame).

The overall P, R and F a results are shown in Fig. 7, while (a) represents P, R and F a values obtained by using mean value as threshold to generate binary maps, (b) represents P, R and F a values obtained by using Otsu function to generate binary maps. From Fig. 7, we can find that the overall performance of the proposed model is better than HAREL, SAVC and RE. By comparing the proposed model with TM, when we use mean value as threshold, the proposed model performs better than TM. When we use the Otsu function to obtain threshold, the proposed model has higher P and F a values, but lower R values than TM. One of the main reasons is that TM generates more irrelevant salient regions than the proposed model, which would cause higher R values with lower P values. Another reason is that when using Otsu function to generate threshold, the binary maps of the propose model would lose some salient regions, which would cause higher P values with lower R values.

Fig. 7
figure 7

Performance comparison with P, R and F a values

Figure 8 shows binary maps obtained by different thresholds on saliency maps of the proposed model, while (a) represents input video frames, (b) represents saliency maps of the proposed model, (c) represents ground-truths, (d)-(k) represent binary maps of different thresholds T = 8 ∗ k(k = {1, 2, …, 8}). From Figure 8, we can find that the regions covered by positive value become smaller when the threshold becomes larger. That is to say, the method of setting threshold would influence performance evaluation results, if the evaluation method is built on binary maps. And this is why Fig. 7a and b obtain different comparison results.

Fig. 8
figure 8

Binary maps obtained by using different thresholds of the proposed model

To have a better comparison, we use different thresholds (0–255) to obtain binary maps. After calculating the P, R and F a values, we draw the PR curve, which can be seen in Fig. 10a. Figure 10b shows the average P, R and F a values. From Fig. 10a, we can see that the PR curve of the proposed model is closer to the (1,1) point, which means the proposed model outperforms other four models. From Fig. 10b, we can see that the proposed model performs better than HAREL, SAVC and TM. Comparing with RE, the proposed model has obvious higher P and F a values, while R values is a little lower than RE. The reason is that RE generates more irrelevant salient regions (see Fig. 9), which can obtain higher R values with lower P values. It can be concluded that the proposed model outperforms RE (Fig. 10).

Fig. 9
figure 9

Binary maps obtained by using different thresholds of RE

Fig. 10
figure 10

PR curves and average P, R and F a values for different saliency detection models

As the ten videos cover situations on different complexities, we further present the PR curves and average P, R, F a values for different saliency detection models on each video, which can be seen in Fig. 11. From PR curves, we can see that the proposed model performs better on most of the videos, especially on BR128T, BR130T and DO01_014. From the average P, R and F a values, we can see that the proposed model has higher P, R and F a values than the other models on AN119T, BR128T, BR130T, DO01_014 and DO02_001, but lower P, R and F a values on DO01_013. Besides, the proposed model has higher P and F a values, lower R value than other models on DO01_030, DO01_055 and VWC102T; lower P values, higher R and F a values than HAREL, higher P and F a values, lower R values than RE on M01_058. According to Fig. 11, on one hand, the PR curves of the proposed model is closer to (1,1) point on DO01_030, DO01_055, VWC102T and M07_058. On the other hand, the proposed model has higher F a values on these four videos. It can be concluded that the proposed model performs better on DO01_030, DO01_055, VWC102T and M07_058.

Fig. 11
figure 11

PR curves and average P, R and F a values for different saliency detection models on different videos

From Fig. 11, we can see that the proposed model performs worse on DO01_013. The reason is that the salient objects in DO01_013 cannot be well distinguished from the background in L, a and b channels. But the proposed model is built on color information of L, a and b channels. Improving the proposed model to perform better on videos like DO01_013 would be one of our further works.

4 Conclusions

In this paper, a novel video saliency detection framework using 3-D DShT is proposed. It begins with generating feature blocks in multi-scales by performing inverse shearlet transform. Saliency blocks are then calculated accordingly. The final detecting result is a combination of saliencies on different levels. Comparing with the popular spatiotemporal mechanism, our framework takes video as 3D information and stands on one exclusive detecting base - distinctiveness. To the best of our knowledge, the work in this paper is the first try to detect video saliency regions on shearlet domain, and the experimental results demonstrate the performance of the new proposed model.

The proposed framework is extendable. In the future, we will further explore how to improve the performance by combining texture, direction and other features. Also, the employ of 3-D DShT requires to load every a number of successive video frames in memory for processing. This would limit the use of the proposed framework in some real-time applications.