1 Introduction

The human vision system (HVS) uses visual attention to extract information from the redundancy of the natural world [1]. Redundancy in natural scene images is generally ineffective for scene classification or recognition. Focused visual attention can be used to identify and remove irrelevant information from a cluttered natural environment, according to Barlow’s efficiency-coding hypothesis [2]. Bottom-up and top-down visual attention mechanisms are the most common. Models built from the bottom up are primarily motivated by external stimuli. A variety of visual information (such as color, frequency, texture, orientation, and motion) is processed to extract image features using [3]. Contrast this with bottom-up approaches that aim to achieve a specific goal, which typically involve high-level information feedback and modulate lower-level vision functions [4]. Saliency map prediction has been successfully applied to both of the above computational neural models, and long-term research has been done on the visual attention mechanism [5].

Many studies have tried to carry out bottom-up or top-down computational modeling to predict saliency maps. In the following, I will briefly review some saliency prediction models that have achieved remarkable performance in saliency prediction. One of the earliest computational models, proposed by Itti et al. [6], was based on the bottom-up mechanisms of low-level vision systems. The model structure contains linear filtering, center-surround differences, across-scale combinations, and linear combinations. Achanta [7] devised a model for area segmentation that generates saliency maps to identify standout objects with clearly defined boundaries. Bruce and Tsotsos [8] proposed a model based on Shannon’s self-information assessment to identity saliency map. Based on an investigation of the amplitude spectrum of natural images, Li [9] presented a novel bottom-up computational model for determining visual saliency. Hou and Zhang [10] proposed a spatial-temporal visual attention model based on feature rarity. Using the incremental coding length (ICL) method, they figured out each feature’s perspective entropy gain and then made a saliency map. Murray et al. [11] showed a color appearance model could be used for producing saliency maps because it involves parameter selection and spatial pooling function. Boolean map-based saliency (BMS) is a novel Boolean map-based saliency model created by Zhang and Sclaroff [12]. According to Gestalt’s theory of figure-ground segregation, the BMS model computes saliency maps by examining Boolean maps’ topological structure [13]. A simple image descriptor known as the image signature was established by Hou et al. [14]. They developed a saliency algorithm based on the image signature. Goferman [15] described a context-aware saliency approach that seeks to recognize image regions that represent the scene. According to this approach, instead of identifying fixation locations, the dominating item is detected. Hou and Zhang [16] demonstrated a straightforward approach for generating the corresponding saliency map in the spatial domain by analyzing the log-spectrum of an input image. Guo [17] proposed a fast method that uses the spectral residual of the amplitude spectrum to build the saliency map. Schauerte and Stiefelhagen [18] firstly proposed employing Eigenaxes and Eigenangles for models of spectral saliency dependent on the Fourier transform. Murray [19] proposed a saliency model based on a low-level spatial-chromatic function of HVS, which successfully predicted chromatic induction phenomena and generated a saliency map [20]. To detect both static and spatial-temporal saliency, Seo and Milanfar [21] provided an innovative, unifying computational framework for detecting saliency. Wavelet transforms have also begun to be widely used to estimate computational vision saliency maps [11]. Compared to the Fourier transform, it has a high resolution in both the frequency and time domains. Wavelet can decompose signals at different scales, also known as multi-resolution/multi-scale analysis or sub-band coding, which can capture more low-level information from the original signal [22]. On the other hand, the wavelet transform approach can explain the primary visual cortex (V1) properties and produce multi-scale and multi-orientation features when provided with stimuli. The final estimated saliency map could sum up all the processed wavelet coefficients through an inverse wavelet transform [23]. When using the wavelet transform to create saliency maps, however, global contrast is lost rather than local information. Spratling [24] proposed a saliency prediction model based on predictive coding theory [25].

Fig. 1
figure 1

Architecture of the proposed saliency prediction model. The left panel image was selected from the MIT1003 dataset. The flow chart shows the framework of the proposed model, containing the chromatic response in the retina and spatial feature processing in the visual cortex. The natural  image is first adapted, before decomposing it into white-black, red-green, and yellow-blue opponent neural channels. In the spatial component, a discrete wavelet transform is applied to each opponent color channel, then the wavelet energy map is measured. In the last step of the proposed model, the CSF is applied to each opponent wavelet energy channel and combined with each opponent’s feature. i and \(\theta \) indicate image and model parameters, respectively. The details of each component are described in the following section. The graph on the right refers to the map of the left panel image’s saliency on the inflated visual cortex using the proposed model

In the last few years, some studies have attempted to estimate saliency maps with deep convolutional neural networks (CNNs), which have achieved impressive performances compared to conventional methods [26,27,28,29,30,31]. Cornia et al. [32] introduced a novel deep architecture for saliency prediction. To predict saliency maps, current state-of-the-art models for saliency prediction use fully convolution networks, which utilize a non-linear combination of features extracted from the last convolution layer. DeepGazeII [33] is a model which forecasts where people will look in images. The model employs features from the VGG-19 deep neural network, which has been trained to recognize objects in images. Compared to conventional methods, deep learning implementations are also easier to transfer to real-life applications such as object detection, video understanding, and image compression. In this study, our main contributions are fourfold:

  1. 1.

    A psychophysically oriented saliency prediction model was proposed, which was inspired by the multi-channel model of human vision system function. The model has a contrast sensitivity function, an opponent color channel, a wavelet transform, a wavelet energy map, and a wavelet transform energy map. The proposed model is a bottom-up model, and it was tested on the different datasets using certain metrics.

  2. 2.

    The spatial chromatic contrast sensitivity function was implemented by Python,Footnote 1 which is available at https://github.com/sinodanishspain/CSFpy.

  3. 3.

    The proposed model achieved strongly stable and better performance with different metrics on natural images, psychophysical synthetic images, and dynamic scenes. Beyond the accuracy of saliency prediction, we take more neuroscience concepts into account rather than statistical concepts, and it is also another computational goal in the study.

  4. 4.

    We suggested that Fourier and spectral-inspired saliency prediction models outperformed other state-of-the-art non-neural network and even deep neural network models on psychophysical synthetic images. The proposed model can be successfully applied to explain the “pop-out” effects in the visual search and attention mechanisms of primate vision systems and inspire the development of better deep learning models. Furthermore, we also suggest that deep neural networks require distinct architectures and goals in order to reliably predict salient performance on psychophysical synthetic images.

The rest of this paper is organized as follows: Sect. 2  introduces the concepts of opponent color space, wavelet decomposition, wavelet energy map estimation, and CSF. Section 3  introduces the saliency map prediction model, along  with different datasets and evaluation metrics. Section 4 presents the experimental results. The final section provides discussions and conclusions for the paper.

Fig. 2
figure 2

Opponent color processing. The first column represents the raw RGB color space, followed by the white–black (WB) channel, red–green (RG) channel, and yellow–blue (YB) channel, each with a gray colormap. The final three columns depict the WB, RG, and YB channels in artificial color, in order to better visualize the opponent color processing in the visual system

2 The proposed saliency prediction model

2.1 Saliency prediction model

In this paper, we propose a biologically inspired visual saliency prediction map, based on the human low-level visual system. The extraction of information from the retina, LGN, and V1 is a critical component of visual neural networks. The color opponent channel, wavelet transform, wavelet energy map, and contrast sensitivity function are the main components of the proposed model architecture. The color opponent channel simulates the response of retinal cells to different spectral wavelengths, and the wavelet transform presents the multi-scale and multi-orientation properties of the V1. The CSF is used to describe the human brain’s susceptibility to spatial frequencies. The details of each component are described in the following sections. Figure 1 depicts the computational saliency prediction model architecture.

2.2 Gain control with von Kries chromatic adaptation model

Gain control exists in the visual information processing pipeline in the retina and cortex. In other words, gain control influences both top-down and bottom-up visual information flows, as well as attention-related cognitive functioning [34]. Meanwhile, gain control always strives to maintain a steady-state brain and self-regulation condition between the brain and the natural environment. In the von Kries model, we multiply each channel of the image with the gain value after normalizing its intensity [35,36,37]. However, there are some implications to this approach. The first is that the channels are considered independent signals, which is why we use independent gains. Second, this gain is added not in the RGB space but, instead, in the tristimulus LMS space. Assuming that the LMS is the same as our image’s tristimulus values, the von Kries model can be written in math as:

$$\begin{aligned}{} & {} L_{2}=\frac{L_{1}}{L_{\max }} L_{\max 2},\nonumber \\{} & {} M_{2}=\frac{M_{1}}{M \max } M_{\max 2},\nonumber \\{} & {} S_{2}=\frac{S_{1}}{S \max } S_{\max 2},\end{aligned}$$
(1)
$$\begin{aligned}{} & {} \left[ \begin{array}{c} L_{post} \\ M_{post} \\ S_{post} \end{array}\right] =\left[ \begin{array}{ccc} \frac{1}{L_{\max }} &{} 0 &{} 0 \\ 0 &{} \frac{1}{M_{\max }} &{} 0 \\ 0 &{} 0 &{} \frac{1}{\sin a x} \end{array}\right] \left[ \begin{array}{c} L_{1} \\ M_{1} \\ S_{2} \end{array}\right] , \end{aligned}$$
(2)
$$\begin{aligned}{} & {} \left[ \begin{array}{c} L_{2} \\ M_{2} \\ S_{2} \end{array}\right] =\left[ \begin{array}{ccc} L_{\max 2} &{} 0 &{} 0 \\ 0 &{} M_{\max 2} &{} 0 \\ 0 &{} 0 &{} S_{\max 2} \end{array}\right] , \end{aligned}$$
(3)

where \(L_{1}\) corresponds to the original image’s  L  values; \(L_{Max}, M_{Max}\), and \(S_{Max}\), respectively, correspond to the maximum value of each channel in the LMS image; \(L_{Max2}, M_{Max2}\), and \(S_{Max2} \) are the gain values with a set value of 0.6 in the proposed model; and \(L_{2}\) is the corrected L  channel after adaptation.

Fig. 3
figure 3

The modeling of V1 simple and complex cells in each opponent channel. The red rectangle indicates hypercolumns  in the visual cortex. From left to right, the graph depicts WB opponent neurons with different orientations and scales. The following zoomed out top/bottom graphs with artificial color, for better visualization of features, in each hypercolumn indicate RG/YB opponent neurons across different orientations and scales. The V1 complex cells can be obtained from the sum of squares of  wavelet transform features across scales and orientations in the simple cells

2.3 Color appearance model

Representation of color in the brain can improve object recognition and identity. Trichromatic theory [38] and the color appearance model proposed based on the functioning of the sensors encode color information and have been widely used in low-level image processing. Two functional types of chromatic sensitivity or selectivity sensors were found—single-opponent and double-opponent neurons—based on the responses of long (L), mediate (M), and short (S) cones in the physical world [39]. Most saliency prediction models use CIELAB and YUV color spaces for the opponent color spaces. In our case, we use another opponent’s color space [40, 41], and the color space transform matrix from RGB to \(O_{1}O_{2}O_{3}\) can be expressed as:

$$\begin{aligned} \left[ \begin{array}{l} O_{1} \\ O_{2} \\ O_{3} \end{array}\right] =\left[ \begin{array}{ccc} 0.2814 &{} 0.6938 &{} 0.0638 \\ -0.0971 &{} 0.1458 &{} -0.0250 \\ -0.0930 &{} -0.2529 &{} 0.4665 \end{array}\right] \left[ \begin{array}{l} R \\ G \\ B \end{array}\right] . \end{aligned}$$
(4)

The test natural  scene images (of sizes \(256\times 256\) and \(512\times 512\) ) were selected from the Signal and Image Processing Institute, University of Southern California,Footnote 2 and the Kodak lossless true-color image databaseFootnote 3(of sizes \(512\times 768\), \(768\times 512\), and \(768\times 512\) ). The total natural  color images were resized into the same size (8 bits, \(256\times 256\) ) as test images. All natural  chromatic images were converted from RGB space to the \(O_{1}O_{2}O_{3}\) domain, based on the above conversion matrix. As can be seen in Fig. 2, the chromatic information (white-black, red-green, and yellow-blue) was decomposed into each channel.

2.4 Wavelet energy map

2.4.1 Visual cortex receptive fields with wavelet filters

The primary visual cortex contains neurons that reflect the structure of the retinal image in terms of a wavelet basis, and the visual simple and complex cells can be modeled with wavelet filters. In our case, we did not consider the in-depth details of each hypercolumn neuron’s interaction  mechanisms (e.g., Li’s model [42]). The simulated V1 complex receptive fields sum all the squares of different scales and orientations after the wavelet transform (see Fig. 3). The V1 simple receptive fields in each opponent channel are mathematically defined as:

$$\begin{aligned} V_{iv} = s_{i}o_{v} \end{aligned}$$
(5)
$$\begin{aligned} V_{ih} = s_{i}o_{h} \end{aligned}$$
(6)
$$\begin{aligned} V_{id} = s_{i}o_{d}, \end{aligned}$$
(7)

where s indicates receptive filed scales, o refers to orientation—that is, vertical (v), horizontal (h), and diagonal (d)—and i indicates the number of neurons/features. The V1 complex cells can be formulated as:

$$\begin{aligned} V_\textrm{complex} = \sum _{1}^{i} (s_{i}o_{v})^2 + \sum _{1}^{i} (s_{i}o_{h})^2 + \sum _{1}^{i} (s_{i}o_{d})^2. \end{aligned}$$
(8)
Fig. 4
figure 4

The different decomposition levels of the DWT (e.g., first, second, and third levels): “a” indicates the original image, “h” indicates the horizontal feature, “v” refers to the vertical feature, and “d” represents the diagonal feature. The bottom-left image is the original image, and the following images represent the first-, second-, and third-level decomposition features from the original image

2.4.2 Wavelet transform and wavelet energy map

The wavelet image analysis can decompose an image into multi-scale and multi-orientation features, similar to the visual cortex representation. Compared to the Fourier transform (FT), a wavelet transform can represent spatial and frequency information simultaneously. Alfred Haar first proposed the wavelet transform approach, and it has already been widely used in signal analysis [43]; for example, for image compression, image denoising, and classification. Wavelet transforms have already been applied in visual saliency map prediction, and achieved good performance [44]. However, wavelet energy maps remain barely used in visual saliency map prediction, and they can be used to enhance local contrast information in the decomposition sub-bands. In our proposed model, we use a discrete wavelet transform (DWT), which can be written in math as:

$$\begin{aligned} r[n]= ((I * f)[n]) \downarrow 2 = \left( \sum _{k=-\infty }^{\infty } I[k] f[n-k]\right) \downarrow 2 \end{aligned}$$
(9)

where I indicates the input images, f represents a series of filter banks (low-pass and high-pass), and \(\downarrow 2\) indicates down-sampling until the next layer’s signal cannot be decomposed any more (see Fig. 4). A series of sub-band images are produced after convolution with the DWT; then, the wavelet energy map can be calculated from each sub-band feature (see Fig. 5).

The wavelet energy map can be expressed as:

$$\begin{aligned} \mathcal{W}\mathcal{E}(i, j)=\Vert I(i, j)\Vert ^{2}=\sum _{k=1}^{3ind+1}\left| I_{k}(i, j)\right| ^{2}, \end{aligned}$$
(10)

where 3ind indicates the maximum level of an image that can be decomposed in the last level, e.g., 3 level decomposition, and \(I_{k}(i, j)^{2}\) represents the energy map of each sub-band feature.

Fig. 5
figure 5

Each channel’s DWT map and the wavelet energy maps corresponding to it. The first column shows the DWT maps for achromatic (WB) and chromatic (RG, YB) channels. The second column is the wavelet energy map, obtained by summing across scales and orientation features for WB, RG, and YB opponent channels, respectively. The last column shows the sum of squares  energy maps in each opponent channel

2.5 Contrast sensitivity function

The human visual system is sensitive to contrast changes  in natural environments. The visual cortex function can be decomposed into subset compositions, where one of the significant features is the CSF, which can be divided into achromatic and chromatic spatial CSFs [45]. In this proposed computational model, an achromatic CSF (aCSF) and chromatic CSFs (rgCSF and ybCSF) were implemented, which was first proposed by Mannos and Sakrison in 1974 [46], and further improved later [47, 48] (see Fig. 6). The achromatic CSF mathematics is as follows:

$$\begin{aligned}{} & {} {\text {CSF}}(f_{x}, f_{y})=Q(f) * L(f_{x}, f_{y}), \end{aligned}$$
(11)
$$\begin{aligned}{} & {} \quad Q(f)=g *\left( \exp (-(f/f_{m}))-l* \exp \left( -\left( f^{2} / s^{2}\right) \right) \right) , \nonumber \\\end{aligned}$$
(12)
$$\begin{aligned}{} & {} \quad L(f_{x}, f_{y}){=}\nonumber \\{} & {} 1{-}w *\left( 4(1-\exp (-(f/os))) * f_{x}^{2} * f_{y}^{2}\right) /f^{4}), \end{aligned}$$
(13)

where \((f_{x},f_{y})\) indicates a 2D spatial frequency vector (in cycle/deg), f represents the modulus of the spatial frequency (cycle/deg), g represents the overall gain (\(g=330.74\)), \(f_{m}\) is a parameter that controls the exponential decay of the CSF Tyler  (\(f_{m}=7.28\)), l represents the loss at low frequencies (\(l=0.837\)), s is a parameter that controls the attenuation of the loss factor at high frequencies (\(s=1.809\)), w indicates the weighting of the oblique effect (\(w=1\)), and os indicates the oblique effect scale (\(os=6.664\)). The CSFs were applied to the wavelet energy image in the Fourier domains. It can be described by the following formula:

$$\begin{aligned} CSF_{WE} = real({\mathcal {F}}({\mathcal {I}}(\textrm{I}(\textrm{F}(WE.real)) \odot CSF))), \end{aligned}$$
(14)
Fig. 6
figure 6

Achromatic and chromatic CSFs. The images in the top row are 2D CSFs, and the bottom row shows 3D CSFs

where \(\textrm{F}\) indicates the 2D Fourier transform, \({\mathcal {F}}\) indicates the 2D inverse Fourier transform, \(\textrm{I}\) indicates fftshift, which rearranges a Fourier transform by shifting the zero-frequency component to the center of the image, and \({\mathcal {I}}\) indicates ifftshift, which rearranges a zero-frequency-shifted Fourier transform back to the original transform output. In other words, ifftshift undoes the result of fftshift. The Python implementations of the above CSFs (aCSF, rgCSF, and ybCSF) are available at https://github.com/sinodanishspain/CSFpy.

3 Materials and methods

3.1 datasets

The proposed model was tested on several well-known datasets, including MIT1003, MIT300, TORONTO, and SID4VAM. The following sections introduce the basic information of each dataset.

  • MIT1003 is an image dataset that includes 1003 images from the Flickr  and LabelMe collections. The fixation map was generated by recording the eye-tracking data of 15 participants. It is the largest eye-tracking dataset. The dataset includes 779 landscape and 228 portrait images with sizes spanning from \(405\times 405\) to \(1024\times 1024\)  pixels [49].

  • MIT300 is a benchmark saliency test dataset that includes 300 images obtained by recoding a 39-observer eye-tracking dataset. The MIT300 dataset categories are highly varied and natural. The dataset can be used for model evaluation [49].

  • TORONTO includes 120 chromatic images free-viewed by 20 subjects. The dataset contains both outdoor and indoor scenes with a fixed resolution of \(511\times 681\)  pixels [8].

  • SID4VAM is a synthetic image database that is mainly used to psychophysically evaluate the V1 properties. This database is composed of 230 synthetic images, including 15 distinct types of low-level features (e.g., brightness, size, color, and orientation) with different target-distractor pop-out-type synthetic images [50].

3.2 Evaluation metrics

As mentioned before, there are several approaches to evaluate metrics between visual saliency and model prediction. In general, saliency evaluation can be divided into two branches: location-based and distribution-based. The former mainly focuses on the district located in the saliency map, and the latter considers both the predicted saliency and human eye fixation maps as continuous distributions [51]. In this research, we used AUC, NSS, CC, SIM, IG, and KL to evaluate the methods and details of each evaluation metric, as described in the following.Footnote 4

3.2.1 Area under the ROC curve (AUC)

The AUC metric is a popular approach for the evaluation of saliency model performance. The saliency map can be treated as a binary classifier to split positive samples from negative samples by setting different thresholds. The true positive (TP) is the proportion of salient map values beyond a specific threshold at the fixation locations. In contrast, the false positive (FP) is the proportion of salient map values beyond a specific threshold at the non-fixation locations. In our case, the thresholds were set from the saliency map values and the AUC-Judd, AUC-Borji, and sAUC measures [52].

3.2.2 Normalized scanpath saliency (NSS)

The NSS metric usually measures the relationship between human eye fixation maps and model-predicted saliency maps [53]. The NSS can be formally defined as follows given a binary fixation map F and a saliency map S:

$$\begin{aligned}{} & {} N S S=\frac{1}{N} \sum _{i=1}^{N} {\bar{S}}(i) \times F(i), \end{aligned}$$
(15)
$$\begin{aligned}{} & {} \quad N=\sum _{i} F(i) \text{ and } {\bar{S}}=\frac{S-\mu (S)}{\sigma (S)}, \end{aligned}$$
(16)

where N is the total number of human eye positions, \(\mu (s)\) is the mean value of saliency maps, and \(\sigma (S)\) is the standard deviation.

3.2.3 Similarity metric (SIM)

The similarity metric (SIM) is a very famous algorithm for measuring image structure similarity, which has already been widely used in image quality and image processing disciplines [54]. The SIM mainly measures the normalized probability distributions of eye fixation and model-predicted saliency maps. The SIM can be mathematically described as:

$$\begin{aligned} SIM=\sum _{i=1} \min (P(i), Q(i)), \end{aligned}$$
(17)

where P(i) and Q(i) are the normalized saliency map and the fixation map, respectively. A similarity score should be in the range between zero and one.

3.2.4 Information gain (IG)

Information gain is an approach to measuring saliency map prediction accuracy from an information-theoretic view. It mainly measures the critical information contained in the predicted saliency map, compared with a ground-truth map [55]. The mathematical formula for the IG can be expressed as:

$$\begin{aligned}{} & {} I G\left( P, Q^{B}\right) =\frac{1}{N} \sum _{i} Q_{i}^{B}\nonumber \\{} & {} \quad \left[ \log _{2}\left( \epsilon +P_{i}\right) -\log _{2}\left( \epsilon +B_{i}\right) \right] , \end{aligned}$$
(18)

where P indicates the predicted saliency map, \(Q^{B}\) is the baseline map, and \(\epsilon \) represents a regularity parameter.

3.2.5 Pearson’s correlation coefficient (CC)

Pearson’s correlation coefficient (CC) is a linear approach that measures how many similarities there are between the predicted saliency map and the baseline map [56].

$$\begin{aligned} C C\left( P, Q^{D}\right) =\frac{\sigma \left( P, Q^{D}\right) }{\sigma (P) \times \sigma \left( Q^{D}\right) }, \end{aligned}$$
(19)

where P indicates the predicted saliency map and \(Q^{D}\) is the ground-truth saliency map.

3.2.6 Kullback–Leibler divergence (KL)

The Kullback–Leibler divergence (KL) is used to measure the distance between the samples of two distributions from an information-theoretic perspective [55]. It can be formally defined as:

$$\begin{aligned} KL\left( P, Q^{D}\right) =\sum _{i} Q_{i}^{D} \log \left( \epsilon +\frac{Q_{i}^{D}}{\epsilon +P_{i}}\right) , \end{aligned}$$
(20)

where P indicates the predicted saliency map, \(Q^{D}\) is the ground-truth saliency map, and \(\epsilon \) represents a regularity parameter.

3.2.7 Other metrics

We also evaluated the performance of different salient prediction models through two main metrics: precision-recall curves (PR curves) and the F-measure.Footnote 5 By binarizing the predicted saliency map with thresholds in [0,255], a series of precision and recall score pairs were calculated for each dataset image. The PR curve was plotted using the average precision and recall of the dataset under different thresholds [57].

Table 1 Quantitative scores of several models for the MIT1003 dataset. The baseline ITT model is shaded in light blue and the proposed model is shown in green
Table 2 Quantitative scores of several models for the TORONTO dataset. The baseline ITT model is shown in light blue and the proposed model is shown in green
Table 3 Quantitative scores of several models for the SID4VAM dataset. The baseline ITT model is shown in light blue and the proposed model is shown in green

4 Experimental results

4.1 Quantitative comparison of the proposed model with other state-of-the-art models

To evaluate the performance of the proposed model, we compared it with eight other state-of-the-art models. For comparison of the quantitative results, we selected the MIT1003 and SID4VAM benchmarks. These results are reported in Tables. 1, 2, and 3. The superior performance, in terms of saliency prediction, was achieved by models based on biological/cognitive and Fourier/spectral foundations. Our model achieved stable and superior performance, in terms of different evaluation metrics compared to other biological/cognitive- and Fourier/spectral-inspired models. However, saliency map prediction based on a convolutional neural network outperformed other models in natural scene images, as more images were used to train the neural network. Consequently, these cannot be compared to other models, as they are based more on statistical than neuroscientific principles. In this paper, we emphasize understanding saliency prediction from a neuroscience perspective, in order to further help us understand the mechanism of visual attention cognitive function. Furthermore, biological/cognitive- and Fourier/spectral-inspired saliency detection models were outperformed by deep learning approaches  (ML_Net and DeepGazeII) in the SID4VAM dataset (see Tab. 3 and Fig. 10). As previously said, SID4VAM is a synthetic image database that is primarily used to psychophysically test the V1 properties, which is also why we stated that deep learning models refer more to statistics than neuroscience in the explanation of human visual attention mechanisms.

4.2 Qualitative comparison of the proposed model with other state-of-the-art models

We qualitatively tested the proposed model using the MIT1003, MIT300, TORONTO, SID4VAM, and UCF Sports datasets.Footnote 6 We also compared the model’s performance with that of other state-of-the-art saliency prediction models on the MIT1003, TORONTO, and SID4VAM datasets. Figures 7, 8, 9, 10, and 12 show the saliency map results when the proposed model and other state-of-the-art models were applied to sample images from the studied datasets. The performance each of saliency prediction model was evaluated through AUC and PR curves, as shown in Fig. 11. We can see that the proposed model could predict most of the salient objects in the given images. Furthermore, the proposed model could successfully detect the orientation, boundary, and pop-out functions when the model was applied to the SID4VAM dataset. In summary, our proposed biological/Fourier/spectral-inspired saliency prediction model achieved superior and stable performance on natural images, psychophysical synthetic images, and dynamic scenes, compared with other existing models.

Fig. 7
figure 7

Performance evaluation on the MIT1003 dataset. The first row shows color images, the second row shows ground-truth saliency maps, and the last row shows the proposed model’s predicted saliency maps, respectively

Fig. 8
figure 8

Left: Performance evaluation on the MIT300 dataset. The first and third columns are color images. The second and fourth columns are the proposed model’s predicted saliency maps. Right: Performance evaluation on the TORONTO dataset. The first and third columns are color images. The second and fourth columns are the proposed model’s predicted saliency maps

Fig. 9
figure 9

Qualitative saliency prediction results on the MIT1003 dataset with different models. The first row shows six stimuli images selected from the MIT1003 dataset. The rows below show the predicted saliency maps obtained with Achanta, AIM, HFT, ICL, ITII, SIM, and the proposed model, as well as the ground-truth (GT) saliency, with artificial color for better visualization

Fig. 10
figure 10

Qualitative saliency prediction results on the SID4VAM dataset with different models. The first row shows six stimuli images selected from the SID4VAM dataset. The rows beneath show the salience prediction results obtained with Achanta, AIM, HFT, ICL, ITII, SIM, and the proposed model, as well as the ground truth (GT) salience,  with artificial color for better visualization. The proposed model can be successfully applied to explain the ”pop-out” effects in the visual search

Fig. 11
figure 11

ROC curve (AUC) and PR curves. Comparison of the area under the ROC curve (AUC) and PR curves, with different thresholds between our method and other state-of-the-art methods on three benchmark datasets

Fig. 12
figure 12

Dynamic saliency prediction. For these sample frames from the UCF Sports Action dataset, the model clearly produced better results and perfectly captured the text information on the bottom-left

4.3 Ablation study

Here, we will explore the efficacy of the main components in this ablation study, and we will only measure the model performance with AUC_Judd metric on the MIT1003 dataset. The color appearance model, wavelet transform, wavelet energy map, and contrast sensitivity functions used for image feature extraction are firstly evaluated for their efficiency. Our method’s performance has been stable and consistent. However, without using any of the modules, for example, remove color opponent channels, wavelet transform, wavelet energy map, and contrast sensitivity functions, performance degradation occurs (see Tab. 4).

5 Discussion

Saliency modeling has become a well-known area of study in computer vision and neuroscience. As a result, academics have attempted to employ different architectures that have not significantly increased model performance. Instead, we should look into how humans perceive scenes; what draws their attention is crucial. In this study, we addressed numerous methods for going beyond the capabilities of such models.

First, it is vital to comprehend the cognitive attention mechanism. Visual attention is a selective cognitive process that helps us deal with this issue successfully by focusing on key information while disregarding unnecessary information. Spatial attention is important in discrimination and appearance tasks in saliency prediction studies. Investigating the brain underpinnings of visual attention can help us design better and more accurate saliency prediction models. On the other hand, better saliency prediction models can assist us in comprehending the cognitive process of visual attention in the brain. Second, we require multi-model and multi-label datasets to assess the effectiveness of saliency prediction models. The MIT Saliency benchmark datasets were used to evaluate the saliency prediction model. As previously stated, the saliency prediction model should perform better on natural scene photographs and psychophysical synthetic images (e.g., SID4VAM). This could aid in the improvement of the model architecture and our understanding of the cognitive attention process. Third, we offered extensive experimental findings demonstrating that our method consistently achieved steady and better results when compared to other state-of-the-art technologies. It is worth noting that we used biologically inspired visual model estimation to determine saliency. Our proposed saliency prediction model incorporates more neuroscience notions than statistical concepts.

The proposed model incorporates opponent color channels, the wavelet transform, the wavelet energy map, and the contrast sensitivity function, but our model ignores the fact that natural images have been preprocessed multiple times, from retinal response to LGN response, for better saliency prediction accuracy and a close approximation of the human visual system. In addition, other wavelet transform families [22](e.g., Symlets, Morlet, Mexican Hat, Meyer, and Steerable pyramid etc.) may be worth investigating in the future. Saliency prediction in dynamic videos may benefit from converting the spatio-chromatic contrast sensitivity functions to a spatio-temporal one, for instance, white-black, red-green, and yellow-blue channel. Most importantly, we must understand how to properly optimize model parameters. This is one of the most crucial things we can do to improve a model’s ability to perform multiple tasks. Furthermore, the suggested model contains a small number of parameters that were essentially set and fixed for all tests. One of the most important things that contributed to the suggested method’s efficiency was the use of wavelet energy maps and opponent CSFs as features. Furthermore, our extensive experimental results showed that the proposed saliency prediction measure generated from a local image energy estimator is far more successful and straightforward to implement than existing methods. Even though our method was built solely on biologically inspired computational principles, the resulting model structure showed significant agreement with the fixation behavior of the human visual system.

The contribution of this study is limited, and there are some limitations to this study. First, the proposed model is inspired by the human low-level vision system, and each component of the model is already used separately in various saliency models. The proposed model architecture is based primarily on the simulated multi-channel coding principle and integrates and separates image features at different components; additionally, in the final stage of the model, we applied spatio-chromatic CSFs to each channel, which is the first time CSFs have been applied to each channel rather than an achromatic channel. Moreover, as we stated before, the traditional saliency prediction model includes both non-neural network and deep neural network-based saliency prediction models; all of these models only focus on natural image saliency prediction and completely ignore psychophysical synthetic images. As demonstrated in the results, deep neural networks outperform traditional (non-neural network) saliency prediction models in natural image saliency prediction, but they have some limitations in psychophysical synthetic images, where, surprisingly, traditional (non-neural network) saliency prediction models outperform deep neural networks. Furthermore, deep neural networks became more unstable in the presence of distracting noise in natural images, while non-neural network saliency prediction remained more reliable [58]. Bowers et al. [59] showed more problems with using deep neural networks to model the human vision system, and all of these suggested that deep neural networks have to be tested with more broadly defined tasks to improve their generality.

The extension study can be examined in the future following this study. First, the noise or other degradation processing is applied to natural images or psychophysical synthetic images to completely check how they affect the saliency prediction results, and it is very important for us to improve the performance of saliency prediction models in real complex environments, for example, in a fog or rain situation. Second, the saliency performance of both natural images and psychophysical synthetic images can be checked with vision transforms to see whether they have some similar behavior compared to deep nets. Third, concerning training psychophysical synthetic images with deep neural nets, we need large psychophysical synthetic images, which will be more fair for deep nets and can better help us check their performance in psychophysical synthetic images and give us more evidence of the similarities and differences between deep nets and human vision.

Table 4 Ablation study of the proposed model. The module’s role in the saliency prediction model on the MIT1003 dataset

6 Conclusions

In this study, a computational psychophysical visual saliency prediction model inspired by a low-level human visual pathway was proposed. The model includes color opponent channels, wavelet transform, wavelet energy map, and contrast sensitivity functions in order to predict the saliency map. The model was evaluated by classical benchmark datasets and achieved strongly stable and better performance in terms of visual saliency prediction, compared with the baseline model. Furthermore, we found that models based on deep neural networks outperformed ours in terms of natural image salience prediction but underperformed for psychophysical synthetic images. In contrast, Fourier/spectral-inspired models had the opposite effect, as Fourier/spectral-inspired models simulate optical neural processing from the retina to the V1. However, deep neural networks take statistics into account more than low-level vision system functioning, and we argue that deep neural networks cannot reliably predict salient performance on psychophysical synthetic images without using specialized architectures and goals. Lastly, we added spatial-temporal saliency prediction to our model, and it was able to pick out the most important thing in the videos.