Introduction

The highly evolved human vision system enables us to rapidly attend to the conspicuous locations within a scene. It is attention mechanism that facilitates us to locate these salient regions. The visual system of human being receives an enormous amount of information from the outside world at each moment. But the information conveyed to the high level of brain is highly reduced through visual processing in the fovea and the ganglion cells in the retina, the lateral geniculation nucleus, the primary visual cortex V1 area and so on. This mechanism can be adopted in computer vision tasks like segmentation (Mishra et al. 2009), object recognition (Liu et al. 2011), visual tracking (Mahadevan and Vasconcelos 2009), image compression (Itti 2004), etc.

With regard to types of attention mechanism, top-down models which are task-driven, and bottom-up models which are stimuli-based, are two main branches. Virtually, these two types of mechanism interact with each other. Top-down attention refers to the process of biased visual perception based on specific tasks or intentions. For bottom-up models, the attended regions are in general sufficiently distinct with respect to surrounding areas, in terms of kinds of low-level features like intensity, color, orientation or motion. Many existing models fall into this category (Achanta et al. 2009; Guo et al. 2008; Itti et al.1998). Among different kinds of bottom-up models, saliency map (Koch and Ullman 1987), a topological map containing global conspicuity information, is frequently assumed and utilized as it directly demonstrates the attended locations or regions. In this paper, the focus is on computational model in pure bottom-up manner.

The typical biologically inspired model named Neuromorphic Vision Toolkit (NVT) is proposed by Itti et al. (1998) and it follows the Feature Integration Theory of Treisman (Treisman and Gelade 1980). This model mainly employs subtraction between filtered input with diverse scales to simulate center-surround difference [Difference of Gaussian (DoG) filter], which represents the on–off and off–on effects of visual receptive fields on ganglion cells. Besides, it adopts color opponents and Gabor filter of different orientations with multiple scales, simulating the process of simple cells on visual cortex, to extract visual features. After obtaining the across-scale difference, the model normalizes each feature map to emphasize the active location which is mainly inspired by cortical lateral inhibition mechanisms. By doing so, it calculates saliency with multiple channels and multiple scales on the mechanism of center-surround difference. This model has biological visual structure and basis in spatial domain, and the saliency map is mostly coincident with the fixation focuses of human being in both psychological patterns and natural images.

However, since it employs center-surround differences to simulate the ganglion cells’ processing of the retina, some low-frequency information in scenes is largely discarded, which causes the failure of extracting saliency information of large regions that contain a lot of low-frequency components. The operation of normalization (inhibition mechanism) makes high-frequency components stand out as well. As a result, this model basically extracts salient points instead of consistent area. There are actually many other models derived from this baseline one (Harel et al. 2007; Le Meur et al. 2007; Walther and Koch 2006).

Another category of models (Guo et al. 2008; Hou and Zhang 2007; Bian and Zhang 2010; Yu et al. 2011; Li et al. 2013) are based on frequency domain analysis, which have fast computational speed. Guo et al. propose the Phase spectrum Quaternion Fourier Transform (PQFT) model which pops out the edge information of objects, since phase information is related to local properties (form and position) of the image (Oliva and Torralba 2001). Li et al. introduce Hypercomplex Fourier Transform (HFT) model, in which multiple Gaussian filters with diverse scales are used to filter the log-amplitude spectrum in order to highlight the salient information. No evidences show the occurrence of frequency operations in human brain, so such models are basically not consistent with the visual system of human being.

There are also some engineering application based models which are designed for specific applications such as large object segmentation, object recognition etc. From the viewpoint of frequency band, object with large size covers more low frequency components. So Achanta et al. proposed a Frequency-Tuned Saliency (FTS) algorithm which retains most of the frequency components of images in order to realize better segmentation. It calculates saliency in spatial domain by simple subtraction between Gaussian filtered image and its global mean, i.e. most frequency components are preserved except the direct current (DC). This model is computationally efficient and has good performance. However, this kind of models (Cheng et al. 2011; Perazzi et al. 2012) is only effective for large object segmentation.

To our knowledge, it turns out that most existing models focus on certain kinds of objects or only adapt to fixed situations. For instances, the NVT model is merely able to extract salient points with much high-frequency information, while those engineering-based models are well applicable to large objects recognition. The idea of the paper is to build a computational model which has the ability to automatically make adjustments according to stimuli. So, this paper proposes a bottom-up model from the perspective of frequency domain though manipulated in spatial domain, based on the biological discovery of non-Classical Receptive Field (nCRF) of the ganglion cells in the retina. Thanks to the discovery of nCRF (Li et al. 1992), which complements and interacts with CRF according to the stimuli, the low-frequency loss caused by center-surround mechanism can be largely compensated by the tuning of nCRF areas outside the CRF. The conception of nCRF and CRF could be considered related to the processes of magnocellular cells of lateral geniculate nucleus (LGN) and those of parvocellular cells of LGN (Shi et al. 2011).

The phenomenal traits of the proposed model are: (1) improving the classical Itti’s NVT model by employing Gabor wavelet transform, taking multi-orientation and multi-scale (Itti’s) into account, as well as retaining the low-frequency components in order to better tune the frequency bandwidth adaptively; (2) based the discovery of nCRF, proposing one way to adaptively adjust the frequency band according to diverse stimuli and a method to select the optimal scale in Gabor wavelet domain.

The rest of paper is organized as follows. In “Biological background for the algorithm” section, the biological background of the proposed model will be introduced. In the next part, “The proposed algorithm” section, the proposed algorithm will be described in detail including how to decompose frequency bands with Gabor wavelet and how to whiten and select them. Experimental results and comparisons are followed in “Experimental results and discussions” section with discussions on various models. Finally, conclusions are made in “Conclusions and future work” section.

Biological background for the algorithm

As mentioned before, some spatial and spectral models would fail in some scenarios especially when salient regions are relatively large due to the center-surround processing they employ or insufficient use of spectral information. On the contrary, some engineering-oriented models appear to be remarkable on larger objects while small objects or psychological patterns are beyond their reach. Therefore, to learn how human vision system really works, the receptive field models of the ganglion cells in the retina are examined here.

The Classical Receptive Field (CRF) is a center-surround antagonism structure of the retinal ganglion cells and Rodieck et al. propose the DoG function to depict it. In frequency domain, the DoG structure is typically represented as a ring band shown in Fig. 1a (left). Many bottom-up models, like NVT etc., adopt this structure in their saliency computations, but it is actually flawed since it merely contains certain high pass bands, while leaves out the low-frequency components at center region shown in Fig. 1a, even if several DoGs of different scales are adopted. It might result in high lighting only small salient regions or edges of large objects but failure to capture the whole object.

Fig. 1
figure 1

Left frequency domain of DoG; right frequency domain of ECRF, the central part is the effective area of nCRF

Physiologists have found out in 1960s, however, that center-surround CRF can be influenced by a larger region outside the CRF (Ikeda and Wright 1972). This area is regarded as non-Classical Receptive Field (nCRF) that can inhibit the antagonistic effect of center-surround and compensate the loss of low-frequency components. In order to explain the relation between CRF and nCRF, Ghosh et al. (2006) suggested the following equation named ECRF using three zero-mean Gaussians with different variances:

$$ECRF(\sigma_{1} ,\sigma_{2} ,\sigma_{3} ) = A_{CRF} \left( {\frac{1}{{2\pi \sigma_{1}^{2} }}e^{{ - \frac{{x^{2} + y^{2} }}{{2\sigma_{1}^{2} }}}} - \frac{1}{{2\pi \sigma_{2}^{2} }}e^{{ - \frac{{x^{2} + y^{2} }}{{2\sigma_{2}^{2} }}}} ) + A_{nCRF} \frac{1}{{2\pi \sigma_{3}^{2} }}e^{{ - \frac{{x^{2} + y^{2} }}{{2\sigma_{3}^{2} }}}} } \right)$$
(1)

where \(ECRF( \cdot )\) represents the response function, \(\sigma_{1}\), \(\sigma_{2}\) and \(\sigma_{3} \;(\sigma_{1} < \sigma_{2} < \sigma_{3} )\) are variances representing region size of the center, the antagonistic surround and the extended non-inhibitory surround respectively, \(A_{CRF}\) and \(A_{nCRF}\) represent the corresponding amplitudes of both structures. The first two terms refer to the classical center-surround structure (DoG filter). The last term is the compensating extended function which is a Gaussian with larger variance in spatial domain.

Figure 1a shows frequency band of classical DoG filter (the first two terms in Eq. (1)) with \(\sigma_{2} = 3\sigma_{1}\) in 1D (right) and 2D (left) cases. Figure 1b is the frequency band of ECRF structure with \(\sigma_{2} = 3\sigma_{1}\) and \(\sigma_{3} = 3\sigma_{2}\). Note that in Fig. 1b both structures have the same amplitudes that \(A_{CRF} = A_{nCRF}\). In Fig. 1b, the central part of frequency domain, which contains much of low-frequency information, is well preserved. It should be noted that the DC component (original point in Fig. 1b) is not necessary which will be removed in subsequent processing. Although the structure in Fig. 1a does not include the DC component as well, it fails to contain the low-frequency information around the DC component.

It is illustrated in Fig. 1 that the structure of nCRF is to adjust the frequency bandwidth of DoG filter thus compensate the loss of low-frequency information. That is why human beings can easily pay attention to salient objects in a scene with arbitrary size. In contrast, those models using center-surround filters to calculate saliency actually do not include low-frequency components. And the model like PQFT, flatting the amplitude spectrum, is just to heighten high-frequency components and extrude edges of object. For the model like FTS, it always extracts the same range of frequency band no matter what the stimulus is, in fact however, the range of frequency bands should vary according to the stimuli.

Our purpose of saliency computation is mainly to extract certain frequency bands of input image according to the input stimuli, which might be consistent with the idea of grained-scale process and minute-scale process with M (magnocellular) and P (parvocellular) pathway respectively (Shi et al. 2011). In order to achieve this, the whole frequency domain of input image is decomposed into several bands in a discrete way. Based on these discrete bands, some optimal bands containing meaningful saliency information are selected to build the final saliency map. By using decomposition and selection, it is convenient to take into account both the multi-scale subtraction of center-surround process employed by Itti’s model and the retention of low-frequency information which is not considered in Itti’s model.

With regard to the method of frequency domain decomposition, wavelet transform is utilized here to carry out bands division. Discrete wavelet transform, which takes both spatial and spectral information into consideration, performs a logarithmic division of frequency domain. This is more practical than FFT as low-frequency components are always with low spatial resolution and thus need detailed division in frequency domain. A fine division made on low-frequency components can achieve a better effect on saliency computation. In addition, wavelet transform is of multi scales, representing different bands in frequency domain, and it can categorize each frequency band into different orientations that like simple cells do in the primary visual cortex (except low-frequency part). This also facilitates the calculation of saliency. These orientations of wavelet transform in each frequency band correspond to different sub-bands. These sub-bands will be whitened across channels to highlight the saliency information, so do the low-frequency ones. After that, optimal bands among high-frequency or low-frequency ones could be selected. These operations can partly simulate the mechanism of frequency bandwidth adjustment and achieves the same effect of frequency band selection.

The proposed algorithm

The basic steps of proposed algorithm are band division, whitening and band selection. The diagram is shown in Fig. 2. Prior to the processing, the original color image is converted to Lab color space to form three channel images. For each channel image, Gabor Wavelet is employed to decompose the channel into different feature maps corresponding to different frequency bands. After whitening and fusion, one or two local saliency maps corresponding to certain frequency bands are selected to generate the final saliency map. Each step in diagram of the algorithm is shown in Fig. 2.

Fig. 2
figure 2

Diagram of the algorithm

Gabor decomposition

With regard to different categories of wavelet functions, Gabor function is adopted here to carry out decomposition because it is similar to the process, which is also employed by Itti’s NVT model to analyze the orientation information, of simple cells in the primary visual cortex. Additionally, the low-frequency components of Gabor wavelet domain are maintained. The illustration and formula of Gabor filter are shown in Fig. 3 and Eq. (2), respectively.

Fig. 3
figure 3

Five Gabor 2D filters (top row) with their corresponding amplitude spectrum (bottom row). From left to right low-frequency part, high-frequency part with \(0^\circ ,\;45^\circ ,\;90^\circ ,\;135^\circ\) orientations

Therefore, 2D Gabor filter rather than Fast Wavelet Decomposition is employed to accomplish wavelet decomposition in order to obtain more information on orientations. The 2D Gabor function is:

$$g(x,y;\lambda ,\theta ,\sigma ) = \exp \left( { - \frac{{x^{{{\prime }2}} + y^{{{\prime }2}} }}{{2\sigma^{2} }}} \right)\cos \left( {2\pi \frac{{x^{{\prime }} }}{\lambda }} \right)$$
(2)

where \(x^{{\prime }} = x\cos \theta + y\sin \theta\), \(y^{{\prime }} = - x\sin \theta + y\cos \theta\), and \(\theta = \{ 0^{ \circ } ,\;45^{ \circ } ,\;90^{ \circ } ,\;135^{ \circ } \}\). And \(\lambda\) is the wavelength, \(\sigma^{2}\) is the variance of the Gaussian envelope. Four band pass filters and one low pass filter (when \(\lambda\) approaches to infinity), together amount to five 2D Gabor filters. These five Gabor filters can almost cover the whole frequency domain at each scale. They are shown in Fig. 3.

The diagram of Gabor decomposition with three scales in one channel is shown in Fig. 4 and the relationships between spatial feature maps (Fig. 4a) and spectral bands (Fig. 4b) are also illustrated.

Fig. 4
figure 4

Illustration of Gabor decomposition and relationships between feature maps and bands. a High-frequency sub feature maps and low-frequency feature maps. b Corresponding frequency domain

Please note that the actual sizes of feature maps \(f_{Bx}\) are half of those of \(f_{Ax}\) and \(f_{Cx}\) are also half of those of \(f_{Bx}\), for \(x = 1,\;2,\;3,\;4\) and sizes of feature maps at the same scale are equal. Since these feature maps correspond to various frequency components, selections made on these maps are equivalent to those made on frequency components. Therefore, this approach calculates saliency based on the feature maps in spatial domain. In the following sections, feature maps are used to represent frequency bands.

For Eq. (2), \(\theta = \{ 0^{ \circ } ,45^{ \circ } ,90^{ \circ } ,135^{ \circ } \}\) and the scale is \(\sigma = 7/5\). The low-pass 2D Gabor filter sets \(\lambda\) a large number like \(\lambda_{low} = 2.5^{10}\), and four other high-pass ones set \(\lambda_{high} = 2.5\). The sizes of these filters are \(15 \times 15\) pixels (shown in Fig. 3). Experimental results indicate that saliency map computation is insensitive to the parameters of Gabor filters, as long as the 5 Gabor filters could cover the whole frequency domain. Many more orientations can be included as well.

Concretely, the input channel image is filtered to generate feature maps repeatedly with Gabor wavelet filters and the low-frequency feature map is keeping down-sampled until the height of decomposed map is less than 32 pixels. Besides, if the height of input image is greater than 256 pixels, the feature maps at first scale are discarded as they contain less significant information even most can be considered as noise. After filtering, four high-frequency feature maps and one low-frequency feature map at each scale (scale 1 to scale N, the typical value of N is 3–5 according to the original size of input image) shown in Fig. 2 are obtained.

Whitening and computation of local saliency maps

The processing of feature maps includes whitening which aims to extract saliency information and fusion which sums the whitened feature maps to generate the local saliency maps.

At each scale for all channels, the high-frequency feature maps and low-frequency ones are whitened separately using Zero-Phase Component Analysis (ZCA) whitening (http://ufldl.stanford.edu/wiki/index.php/Whitening), a method similar to Principal Component Analysis (PCA) whitening. Whitening is the process of decorrelation and orthogonalization between feature maps. After whitening, DC component is removed and the variance is normalized for each map. As a result, the unique part of data is underlined while the redundancy is suppressed. The idea of whitening is also employed in several works (Garcia-Diaz et al. 2012; Bian and Zhang 2010). Then the square of whitened feature maps is summed to get local saliency map. There are two local saliency maps at each scale, corresponding to low- and high-frequency bands respectively.

Whitening

ZCA-based signal whitening is operated on feature maps at the same scale across channels for high- or low-frequency bands separately, as shown in the dashed boxes of Fig. 2, so that the resulting feature maps become orthogonal and uncorrelated with each other. Let \({\mathbf{f}}_{i} ,\;i = 1 \ldots n\) be the vectorized feature map for a given scale. \({\mathbf{W}} \in {\mathbf{R}}^{n \times n}\) is a ZCA whitening matrix and the whitened result \({\mathbf{F}}^{{\prime }}\) is:

$${\mathbf{F}}^{{\prime }} {\mathbf{ = WF}},{\mathbf{F}} = [{\mathbf{f}}_{1} ,{\mathbf{f}}_{2} \ldots {\mathbf{f}}_{n} ]^{T} ,\;\;{\mathbf{f}}_{i} \in {\mathbf{R}}^{M \times 1}$$
(3)

where n is the number of feature maps at the same scale, \(n = 4 \times 3\) for high-frequency feature maps, and \(n = 3\) for low-frequency feature maps. M is pixel number of a feature map. It should be noted that for high- or low-frequency feature maps at different scales, the whitening matrix is quite differed. After whitening, each whitened feature map has unitary variance and zero mean. Whitening can not only remove the DC component which is not necessary for further processing, as mentioned above, but can also highlight the saliency information.

Figure 5 gives a target search example of psychological pattern in conjunctional conditions, i.e. the unique red \(90^\circ\) bar is inserted in red \(0^\circ\) bars and green \(90^\circ\) bars with orientation disturbance. Most visual attention models would fail in this case but ours would not. By whitening the feature maps of high frequency at scale i, the unique bar is popped out in a channel while the others are suppressed.

Fig. 5
figure 5

Illustration of whitening

Computation of local saliency maps

For high-frequency whitened feature maps, the local saliency map is the simple quadratic sum of all whitened maps at each scale. The formula is:

$$S_{hi} = \sum\limits_{j = 1}^{3 \times K} {f_{hij}^{{{\prime }2}} }$$
(4)

where \(S_{hi}\) is the high-frequency local saliency map of ith scale, K is number of orientations and is set as 4 here, \(f_{hij}^{{\prime }}\) is the jth whitened high-frequency feature map. After computation according to Eq. (4), N local saliency maps for N scales in high-frequency bands are obtained shown in Fig. 2, which correspond to N bands.

For low-frequency feature maps, the fusion strategy is somewhat diverse. After whitening, the whitened feature maps are combined with certain weights. The weight function is monotonically decreasing with the increase of 2D entropy:

$$S_{li} = \sum\limits_{{j = \{ L,a,b\} }} {\exp ( - 0.5 \times (entropy_{2D} (f_{lij}^{{^{{{\prime 2}}} }} ))^{2} ) \times f_{lij}^{{^{{{\prime 2}}} }} }$$
(5)

where \(S_{li}\) is the low-frequency local saliency map of ith scale, \(f_{lij}^{{^{{\prime }} }}\) is the jth whitened low-frequency feature map, and \(entropy_{2D} ( \cdot )\) is the 2D entropy value calculated by Eq. (7) (which will be detailed in the following section) with quantization level setting to eight for compromise between accuracy and computational cost.

The reason for taking weight into account is that large objects are with low responses after whitening due to unitary variance. If the sum is made with same weights like Eq. (4), the larger salient objects will be suppressed. It will be shown in Fig. 6 that a clear structure of large object has low 2D entropy. By adding this weight, the importance of feature maps containing large salient objects will not be diminished. Thus, the saliency for large objects is popped out by whitening and preserved by weighted summation. As mentioned above, there are N local saliency maps for low-frequency band.

Fig. 6
figure 6

Illustration of selection principle

Selection via importance measure

Dozens of local saliency maps covering different frequency bands are obtained. Among them one or two maps (bands) with most significant saliency information will be picked out. Therefore, an importance measure criterion is set up to complete the selection task. It incorporates two metrics: one is the maximum response of feature map and the other measures the clutter degree. The 2D entropy (Abutaleb 1989; Yang et al. 1996) is used in this paper to estimate the clutter degree of a map.

To calculate 2D entropy, a 2D gray-level histogram taking spatial relations into account is formed in advance by comparing the original image \(f(x,y)\) and the averaging filtered version \(g(x,y) = m * f(x,y)\), where m is a 2D mean filter with \(3 \times 3\) pixels. The 2D histogram is a \(L \times L\) square matrix, where L represents number of gray levels. A pixel located at \((x,y)\) in a map which refers to ith gray level in \(f(x,y)\) and jth gray level in \(g(x,y)\) contributes one counting unit on \(r_{ij}\), where \(r_{ij}\) denotes the number of pixels which are at ith gray level of \(f(x,y)\) and at jth level of \(g(x,y)\). After scanning all pixels, the element of 2D histogram \(p_{ij}\) is calculated as follows:

$$p_{ij} = \frac{{r_{ij} }}{PNUM}$$
(6)

where PNUM represents total pixel number of a map. Then the 2D entropy of a map can be calculated based on the generated 2D histogram:

$$entropy_{2D} = - \sum\limits_{i = 1}^{L} {\sum\limits_{j = 1}^{L} {p_{ij} \log p_{ij} } }$$
(7)

where \(p_{ij}\) is calculated according to Eq. (6).

According to the definition, the 2D histogram mainly takes edge change into account since uniform regions scarcely alter their grey level after averaging filtering. If a map is topologically compact, which means less edge information, the averaged map may still contain relatively less edge information. On the other hand, when a scene is cluttered, smooth filtering may lead to excessive gray level change which accordingly generates relatively greater value of 2D entropy. So, the smaller the 2D entropy value is, the more significant a map is.

By conducting experiments, we find out that 2D entropy value and maximum response of a map can be employed together to be the criterion to select bands. Actually, 2D entropy is defined to measure the clutter degree of an image. The low-frequency part of a scene containing large objects with compact structures usually has lower 2D entropy value. As a result, we tend to choose feature maps of low-frequency in this case. But an image with a single small object appears to have higher response on high-frequency side. So we are inclined to select high-frequency components in this scenario. Taking both factors into account, we set a criterion which favors maps with higher responses and smaller entropy values to make selection. Two examples are shown in Fig. 6, where the low-frequency feature map in the left column has smaller 2D entropy value and the high-frequency map in the right column has higher response.

In other words, optimal local saliency maps with more intensive response and smaller 2D entropy value are picked out, which indicates less clutter or chaos, simultaneously. These operations are meant to adjust the bandwidth information so as to achieve better effect of saliency information extraction. The selection of bands is based on the importance measure of each local saliency map shown as follows:

$$IM = \hbox{max} (map) \times \exp ( - 0.5 \times (entropy_{2D} (map))^{2} )$$
(8)

where \(IM\) is the importance of each map, \(\hbox{max} ( \cdot )\) is the function used to measure the maximum response of a map, \(entropy_{2D} ( \cdot )\) is the 2D entropy value. And map represents the local saliency map indicated in Fig. 2.

Suppose the original image is decomposed into N scales. There are \(2 \times N\) candidate local saliency maps to be selected from, where N maps are related to low-frequency part while the other N are related to high-frequency bands. One or two local saliency maps with no overlapped bands are picked out as optimal maps, either high-frequency or low-frequency, or even both. The selection of optimal local saliency maps depends on their importance values which are calculated by Eq. (8).

To begin with, the local saliency map with the largest importance value \(IM_{\hbox{max} }\) is picked out. If there are no other maps with their importance values larger than \(0.5 \times IM_{\hbox{max} }\), then the local saliency map is just the final saliency map. But if there exist other maps with their importance values larger than \(0.5 \times IM_{\hbox{max} }\), another one with the second largest importance value is selected. In this case, the final saliency map is the result of equal combination of the two maps if their bands are not overlapped. It is worth noting that if the map with the second largest importance is overlapped with the largest one, the third or fourth largest one will be considered.

Experimental results and discussions

To make a comprehensive evaluation on the proposed model, the testing databases include both natural images and psychological patterns/images. Natural images contain not only small sized salient objects but also large ones.

In order to better illustrate the superiority of the proposed model, the comparisons are performed between our model and several state-of-the-art models, including the typical model in space domain, NVT (Itti et al. 1998), the representative spectral model using FFT, PQFT (Guo et al. 2008), the large object segmentation oriented model on engineering, FTS (Achanta et al. 2009) and the model adaptive to various kinds of stimuli, HFT (Li et al. 2013).

Among all of these models, PQFT resizes input to the resolution \(64 \times 64\) and HFT resizes input to \(128 \times 128\) as optimal defaults while others do not carry out resolution adjustment as well as our approach. It is proved that resize of input may make the computation fast but probably leads to irreversible information loss. This will be discussed later.

Quantitative and subjective evaluation

For psychological patterns, the saliency results of several common cases are listed to make a subjective evaluation on each model. For natural data, the outputs of each model are compared with the ground truth in a quantitative manner.

The ground truth data are based on human visual behaviors and mainly include two types: fixation maps and labeled maps. A fixation map is record of human fixation within one image by eye tracking apparatus. Data of this kind are binary maps, with logical 1 (fixation points) dotted over the whole image. Ground truth maps of the other kind are also binary maps, but with consistent areas indicating logical 1 which are labeled by a number of subjects. For the fixation dataset, sAUC (shuffled Area under ROC Curve, the larger is better) is used to measure the performance as it eliminates the influence of center-bias (Tatler et al. 2005; Zhang et al. 2008) while all other metrics are all susceptible to center-bias effect. For the other dataset, segmentation dataset, Precision/Recall is adopted to be a metric. In the calculation of precision and recall, the saliency maps are transformed to a binary map with varying threshold from 0 to 1. Comparing the binary map with ground truth, precision rate is calculated as the number of true positive (intersection of predicted foreground and true foreground) to the number of predicted foreground while the definition of recall rate is the number of true positive to the number of true foreground. A better model has larger area covered by P/R curve.

For all of these models, the final saliency maps are blurred to get optimal effects. We use Gaussian filters with different sigma parameters to blur the saliency maps and pick out the optimal one for each model. The blurring factor is chosen from 0.01 × width to 0.1 × width with 0.01 as interval.

Saliency prediction for natural images

The fixation datasets include Bruce’s (Bruce and Tsotsos 2005), Kootstra’s (Kootstra et al. 2008) and Judd’s (Judd et al. 2009). They contain 120, 100 and 1003 natural images, respectively. The segmentation datasets consist of Achanta’s (Achanta et al. 2009), Li’s (Li et al. 2013) and Zou’s (Zou et al. 2013). They have 1000, 235 and 1500 images, respectively. Zou’s dataset is derived from PASCAL VOC 2012 segmentation challenging.

Some results for natural images are shown in Fig. 7. The quantitative comparisons for fixation datasets and segmentation datasets are shown in Table 1 and Fig. 8 respectively.

Fig. 7
figure 7

Some saliency maps of different models

Table 1 sAUC comparison on fixation datasets
Fig. 8
figure 8

Quantitative comparison on segmentation datasets. a Li’s dataset (235). b Achanta’s dataset (1000). c Zou’s dataset (1500)

Table 1 shows the quantitative comparison between models in terms of sAUC. The proposed model is proved to be effective over all of these fixation datasets. Meanwhile, PQFT also shows relatively good performance over such type of datasets.

The first column of Fig. 7 consists of five original images with different sizes. Their resolutions are 681 × 511, 400 × 300, 763 × 512, 400 × 300 and 333 × 500 from the first to the fifth row respectively. The sizes of objects in these original images are also different. The images of the first and third row include small objects (a man stands by a tree, two people near to a snow mountain), while the large sized objects are arranged on second and fourth row. The image in the last row contains multiple objects. Some models, except NVT, FTS and the proposed, resize the input images (i.e. HFT resize to 256 × 256, PQFT resize to 64 × 64) for better performance. Though NVT and FTS do not perform resizing of input images, they are not effective for both small targets and large ones simultaneously, i.e. NVT is effective only for small ones and FTS only for large ones. It is worth noting that the proposed model is not subject to the original size of input image (do not need to resize image for subsequent process) and is able to pop out objects in diverse sizes. The original height-width ratio is maintained during the whole calculation process.

Figure 7 illustrates that our model is proved to be effective on small salient regions, large ones and images with multiple targets while PQFT and NVT only highlight small objects or edges. FTS always fails when salient objects are relatively small. The results of HFT are not very satisfying. Besides, Fig. 8 indicates that the proposed model is also able to deal with large salient objects. It is worth noting that FTS only has good performance on its own dataset. For Zou’s dataset (Zou et al. 2013), it contains large amount of images with multi segments or objects of different scales, which clearly shows that our model is more robust than others (Fig. 8c).

In Table 2, the average time consumption of each model by calculating 120 images of Bruce’s database is listed, where images are uniformly of size \(681 \times 511\). All codes are written in Matlab and the computer works on Windows 7 platform with an Intel i7-2600 CPU.

Table 2 Average computational time per image

For unbiased comparison, the input images are resized to \(256 \times 256\) for all models. PQFT and FTS are the fastest as their processes are very simple. HFT is relatively slower because it employs 8 scale spaces to analyze the frequency domain. Time consumption of our model consists of decomposition, whitening and map selection. The NVT model is the most computationally expensive as it produces too many features maps and uses iterative normalization.

In order to show the importance of band selection with both 2D entropy and maximum response, we have conducted experiments with different strategies. Several cases are compared: the proposed model, bands simply combined without selection, bands selected only using 2D entropy and selected only with maximum response. Experiments are conducted over the Bruce dataset (Bruce and Tsotsos 2005) and the comparison is shown in Table 3.

Table 3 Comparison of different strategies

The comparison in Table 3 indicates that taking both 2D entropy and maximum response of maps is the optimal strategy to generate saliency map. And combining all the frequency bands has the least satisfying effect since much unnecessary information is included.

Saliency prediction for psychological patterns

Different types of psychological patterns construct another test bench which is important criterion to measure the performance of attention models. Figure 9 shows that the proposed model can deal with all cases of psychological patterns. All models fail case 1 and 3 except ours. The reason is that the whitening process makes the unique color component salient, thus our model is able to predict the saliency of them. These patterns prove the biological plausibility of the proposed model cogently.

Fig. 9
figure 9

Saliency maps of psychological images

PQFT fails some cases especially when distinct pattern is relatively large. It turns out that HFT shows good potential on these patterns as well as our model, except that it fails the first and third rows. FTS focuses on image segmentation and it fails most of these patterns naturally. NVT is also less effective for these patterns.

Discussions

The proposed model is built on the basis of nCRF and it takes the low-frequency information, which is mostly ignored by existing models, into account by considering range of frequency bands. This is, to some extent, consistent with human visual system. The experimental results also prove the feasibility of the proposed model.

A few parameters are involved in the proposed model. The modification of parameters for Gabor filter does not make much difference. And the number of scales decomposed is relatively fixed at 3–5. Besides, the process of whitening is almost parameter free.

With regard to other models, PQFT totally discards the amplitude information (by flatting the amplitude spectrum) and only phase information is utilized for saliency map construction, which leads to only edges being popped out. Besides, top-down instructions are difficult to be contained in this model because it employs quaternion and Fourier transform.

For FTS, as it is only effective on its own database (most images with large salient areas) but fails others, it indicates that retaining most of frequency components is effective for large objects (low-frequency components are crucial for large objects and are contained in FTS). The key defect is that it extracts fixed band width of information for all images. Despite this strategy is effective for large objects, but for small objects, specific bands are required even of certain orientations. Retaining too much frequency information is not necessary and may obstruct saliency prediction in some cases.

The NVT model suffers the problem that center-surround operations exclude much of the low-frequency information, and this information actually contributes a lot when the salient region is relatively large. However, this model is a cognitive model based on biological plausibility and top-down manner is easily manipulated in this kind of model (Zhang et al. 2008).

About HFT, it utilizes Gaussian filters of different scales to filter the log amplitude spectrum of input image to calculate saliency, which lacks enough biological support and the meaning remains unclear in spatial domain.

Conclusions and future work

The paper proposed a saliency model from the perspective of frequency domain by selecting certain bands though implemented in spatial domain. Three main steps are: dividing input image to different feature maps (frequency bands), whitening the feature maps to extract saliency information and picking out optimal maps containing significant saliency information according to the mechanism of receptive field. Our approach turns out to be superior compared with others on various kinds of stimuli, including psychological patterns and natural ones with large or small salient areas. Beyond that, top-down manners or prior knowledge can be easily included. As images are divided into many channels, scales and orientations, diverse weights can be assigned to feature maps when specific tasks are involved.

However, our model still suffers a couple of drawbacks. For one thing, this algorithm requires a bit more computational cost compared to spectral methods. For another, some points are not entirely consistent with specific biological mechanism, for example, 2D entropy as a measure to select feature maps lacks of biological support.

The future work will make more emphasis on how to better match with biological mechanism and how to simplify the calculation of decomposing and selecting bands since these processes are both spatially and temporally complex. Moreover, top-down mechanism appears to be more important as we have interest in target detection in remote sensing images with attention models. We will attempt to combine top-down mechanism with this bottom-up model to deal with target detection tasks in the future.