Saliency prediction based on multi-channel models of visual processing

Li, Qiang

doi:10.1007/s00138-023-01405-2

Saliency prediction based on multi-channel models of visual processing

Original Paper
Published: 08 May 2023

Volume 34, article number 47, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Machine Vision and Applications Aims and scope Submit manuscript

Saliency prediction based on multi-channel models of visual processing

Download PDF

Qiang Li ORCID: orcid.org/0000-0002-5337-0676^1,2

304 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

Visual attention is one of the most significant characteristics for selecting and understanding the outside redundancy world. The human vision system cannot process all information simultaneously due to the visual information bottleneck. In order to reduce the redundant input of visual information, the human visual system mainly focuses on dominant parts of scenes. This is commonly known as visual saliency map prediction. This paper proposed a new psychophysical oriented saliency prediction architecture, which inspired by multi-channel model of visual cortex functioning in humans. The model consists of opponent color channels, wavelet transform, wavelet energy map, and contrast sensitivity function for extracting low-level image features and providing a maximum approximation to the low-level human visual system. The proposed model is evaluated using several datasets, including the MIT1003, MIT300, TORONTO, SID4VAM, and UCF Sports datasets. We also quantitatively and qualitatively compare the saliency prediction performance with that of other state-of-the-art models. Our model achieved strongly stable and better performance with different metrics on natural images, psychophysical synthetic images and dynamic videos. Additionally, we suggested that Fourier and spectral-inspired saliency prediction models outperformed other state-of-the-art non-neural network and even deep neural network models on psychophysical synthetic images. In the meantime, we suggest that deep neural networks need specific architectures and goals to be able to predict salient performance on psychophysical synthetic images better and more reliably. Finally, the proposed model could be used as a computational model of primate low-level vision system and help us understand mechanism of primate low-level vision system. The project page can be available at: https://sinodanishspain.github.io/HVS_SaliencyModel/.

Visual Attention Model Based on Statistical Properties of Neuron Responses

Article Open access 09 March 2015

Predictive Coding with Context as a Model of Image Saliency Map

Human Vision Attention Mechanism-Inspired Temporal-Spatial Feature Pyramid for Video Saliency Detection

Article 22 February 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The human vision system (HVS) uses visual attention to extract information from the redundancy of the natural world [1]. Redundancy in natural scene images is generally ineffective for scene classification or recognition. Focused visual attention can be used to identify and remove irrelevant information from a cluttered natural environment, according to Barlow’s efficiency-coding hypothesis [2]. Bottom-up and top-down visual attention mechanisms are the most common. Models built from the bottom up are primarily motivated by external stimuli. A variety of visual information (such as color, frequency, texture, orientation, and motion) is processed to extract image features using [3]. Contrast this with bottom-up approaches that aim to achieve a specific goal, which typically involve high-level information feedback and modulate lower-level vision functions [4]. Saliency map prediction has been successfully applied to both of the above computational neural models, and long-term research has been done on the visual attention mechanism [5].

Many studies have tried to carry out bottom-up or top-down computational modeling to predict saliency maps. In the following, I will briefly review some saliency prediction models that have achieved remarkable performance in saliency prediction. One of the earliest computational models, proposed by Itti et al. [6], was based on the bottom-up mechanisms of low-level vision systems. The model structure contains linear filtering, center-surround differences, across-scale combinations, and linear combinations. Achanta [7] devised a model for area segmentation that generates saliency maps to identify standout objects with clearly defined boundaries. Bruce and Tsotsos [8] proposed a model based on Shannon’s self-information assessment to identity saliency map. Based on an investigation of the amplitude spectrum of natural images, Li [9] presented a novel bottom-up computational model for determining visual saliency. Hou and Zhang [10] proposed a spatial-temporal visual attention model based on feature rarity. Using the incremental coding length (ICL) method, they figured out each feature’s perspective entropy gain and then made a saliency map. Murray et al. [11] showed a color appearance model could be used for producing saliency maps because it involves parameter selection and spatial pooling function. Boolean map-based saliency (BMS) is a novel Boolean map-based saliency model created by Zhang and Sclaroff [12]. According to Gestalt’s theory of figure-ground segregation, the BMS model computes saliency maps by examining Boolean maps’ topological structure [13]. A simple image descriptor known as the image signature was established by Hou et al. [14]. They developed a saliency algorithm based on the image signature. Goferman [15] described a context-aware saliency approach that seeks to recognize image regions that represent the scene. According to this approach, instead of identifying fixation locations, the dominating item is detected. Hou and Zhang [16] demonstrated a straightforward approach for generating the corresponding saliency map in the spatial domain by analyzing the log-spectrum of an input image. Guo [17] proposed a fast method that uses the spectral residual of the amplitude spectrum to build the saliency map. Schauerte and Stiefelhagen [18] firstly proposed employing Eigenaxes and Eigenangles for models of spectral saliency dependent on the Fourier transform. Murray [19] proposed a saliency model based on a low-level spatial-chromatic function of HVS, which successfully predicted chromatic induction phenomena and generated a saliency map [20]. To detect both static and spatial-temporal saliency, Seo and Milanfar [21] provided an innovative, unifying computational framework for detecting saliency. Wavelet transforms have also begun to be widely used to estimate computational vision saliency maps [11]. Compared to the Fourier transform, it has a high resolution in both the frequency and time domains. Wavelet can decompose signals at different scales, also known as multi-resolution/multi-scale analysis or sub-band coding, which can capture more low-level information from the original signal [22]. On the other hand, the wavelet transform approach can explain the primary visual cortex (V1) properties and produce multi-scale and multi-orientation features when provided with stimuli. The final estimated saliency map could sum up all the processed wavelet coefficients through an inverse wavelet transform [23]. When using the wavelet transform to create saliency maps, however, global contrast is lost rather than local information. Spratling [24] proposed a saliency prediction model based on predictive coding theory [25].

In the last few years, some studies have attempted to estimate saliency maps with deep convolutional neural networks (CNNs), which have achieved impressive performances compared to conventional methods [26,27,28,29,30,31]. Cornia et al. [32] introduced a novel deep architecture for saliency prediction. To predict saliency maps, current state-of-the-art models for saliency prediction use fully convolution networks, which utilize a non-linear combination of features extracted from the last convolution layer. DeepGazeII [33] is a model which forecasts where people will look in images. The model employs features from the VGG-19 deep neural network, which has been trained to recognize objects in images. Compared to conventional methods, deep learning implementations are also easier to transfer to real-life applications such as object detection, video understanding, and image compression. In this study, our main contributions are fourfold:

1.
A psychophysically oriented saliency prediction model was proposed, which was inspired by the multi-channel model of human vision system function. The model has a contrast sensitivity function, an opponent color channel, a wavelet transform, a wavelet energy map, and a wavelet transform energy map. The proposed model is a bottom-up model, and it was tested on the different datasets using certain metrics.
2.
The spatial chromatic contrast sensitivity function was implemented by Python,^{Footnote 1} which is available at https://github.com/sinodanishspain/CSFpy.
3.
The proposed model achieved strongly stable and better performance with different metrics on natural images, psychophysical synthetic images, and dynamic scenes. Beyond the accuracy of saliency prediction, we take more neuroscience concepts into account rather than statistical concepts, and it is also another computational goal in the study.
4.
We suggested that Fourier and spectral-inspired saliency prediction models outperformed other state-of-the-art non-neural network and even deep neural network models on psychophysical synthetic images. The proposed model can be successfully applied to explain the “pop-out” effects in the visual search and attention mechanisms of primate vision systems and inspire the development of better deep learning models. Furthermore, we also suggest that deep neural networks require distinct architectures and goals in order to reliably predict salient performance on psychophysical synthetic images.

The rest of this paper is organized as follows: Sect. 2 introduces the concepts of opponent color space, wavelet decomposition, wavelet energy map estimation, and CSF. Section 3 introduces the saliency map prediction model, along with different datasets and evaluation metrics. Section 4 presents the experimental results. The final section provides discussions and conclusions for the paper.

2 The proposed saliency prediction model

2.1 Saliency prediction model

In this paper, we propose a biologically inspired visual saliency prediction map, based on the human low-level visual system. The extraction of information from the retina, LGN, and V1 is a critical component of visual neural networks. The color opponent channel, wavelet transform, wavelet energy map, and contrast sensitivity function are the main components of the proposed model architecture. The color opponent channel simulates the response of retinal cells to different spectral wavelengths, and the wavelet transform presents the multi-scale and multi-orientation properties of the V1. The CSF is used to describe the human brain’s susceptibility to spatial frequencies. The details of each component are described in the following sections. Figure 1 depicts the computational saliency prediction model architecture.

2.2 Gain control with von Kries chromatic adaptation model

Gain control exists in the visual information processing pipeline in the retina and cortex. In other words, gain control influences both top-down and bottom-up visual information flows, as well as attention-related cognitive functioning [34]. Meanwhile, gain control always strives to maintain a steady-state brain and self-regulation condition between the brain and the natural environment. In the von Kries model, we multiply each channel of the image with the gain value after normalizing its intensity [35,36,37]. However, there are some implications to this approach. The first is that the channels are considered independent signals, which is why we use independent gains. Second, this gain is added not in the RGB space but, instead, in the tristimulus LMS space. Assuming that the LMS is the same as our image’s tristimulus values, the von Kries model can be written in math as:

$$\begin{aligned}{} & {} L_{2}=\frac{L_{1}}{L_{\max }} L_{\max 2},\nonumber \\{} & {} M_{2}=\frac{M_{1}}{M \max } M_{\max 2},\nonumber \\{} & {} S_{2}=\frac{S_{1}}{S \max } S_{\max 2},\end{aligned}$$

(1)

$$\begin{aligned}{} & {} \left[ \begin{array}{c} L_{post} \\ M_{post} \\ S_{post} \end{array}\right] =\left[ \begin{array}{ccc} \frac{1}{L_{\max }} &{} 0 &{} 0 \\ 0 &{} \frac{1}{M_{\max }} &{} 0 \\ 0 &{} 0 &{} \frac{1}{\sin a x} \end{array}\right] \left[ \begin{array}{c} L_{1} \\ M_{1} \\ S_{2} \end{array}\right] , \end{aligned}$$

(2)

$$\begin{aligned}{} & {} \left[ \begin{array}{c} L_{2} \\ M_{2} \\ S_{2} \end{array}\right] =\left[ \begin{array}{ccc} L_{\max 2} &{} 0 &{} 0 \\ 0 &{} M_{\max 2} &{} 0 \\ 0 &{} 0 &{} S_{\max 2} \end{array}\right] , \end{aligned}$$

(3)

where $L_{1}$ corresponds to the original image’s L values; $L_{Max}, M_{Max}$, and $S_{Max}$, respectively, correspond to the maximum value of each channel in the LMS image; $L_{Max2}, M_{Max2}$, and $S_{Max2} $ are the gain values with a set value of 0.6 in the proposed model; and $L_{2}$ is the corrected L channel after adaptation.

2.3 Color appearance model

Representation of color in the brain can improve object recognition and identity. Trichromatic theory [38] and the color appearance model proposed based on the functioning of the sensors encode color information and have been widely used in low-level image processing. Two functional types of chromatic sensitivity or selectivity sensors were found—single-opponent and double-opponent neurons—based on the responses of long (L), mediate (M), and short (S) cones in the physical world [39]. Most saliency prediction models use CIELAB and YUV color spaces for the opponent color spaces. In our case, we use another opponent’s color space [40, 41], and the color space transform matrix from RGB to $O_{1}O_{2}O_{3}$ can be expressed as:

$$\begin{aligned} \left[ \begin{array}{l} O_{1} \\ O_{2} \\ O_{3} \end{array}\right] =\left[ \begin{array}{ccc} 0.2814 &{} 0.6938 &{} 0.0638 \\ -0.0971 &{} 0.1458 &{} -0.0250 \\ -0.0930 &{} -0.2529 &{} 0.4665 \end{array}\right] \left[ \begin{array}{l} R \\ G \\ B \end{array}\right] . \end{aligned}$$

(4)

The test natural scene images (of sizes $256\times 256$ and $512\times 512$ ) were selected from the Signal and Image Processing Institute, University of Southern California,^{Footnote 2} and the Kodak lossless true-color image database^{Footnote 3}(of sizes $512\times 768$, $768\times 512$, and $768\times 512$ ). The total natural color images were resized into the same size (8 bits, $256\times 256$ ) as test images. All natural chromatic images were converted from RGB space to the $O_{1}O_{2}O_{3}$ domain, based on the above conversion matrix. As can be seen in Fig. 2, the chromatic information (white-black, red-green, and yellow-blue) was decomposed into each channel.

2.4 Wavelet energy map

2.4.1 Visual cortex receptive fields with wavelet filters

The primary visual cortex contains neurons that reflect the structure of the retinal image in terms of a wavelet basis, and the visual simple and complex cells can be modeled with wavelet filters. In our case, we did not consider the in-depth details of each hypercolumn neuron’s interaction mechanisms (e.g., Li’s model [42]). The simulated V1 complex receptive fields sum all the squares of different scales and orientations after the wavelet transform (see Fig. 3). The V1 simple receptive fields in each opponent channel are mathematically defined as:

$$\begin{aligned} V_{iv} = s_{i}o_{v} \end{aligned}$$

(5)

$$\begin{aligned} V_{ih} = s_{i}o_{h} \end{aligned}$$

(6)

$$\begin{aligned} V_{id} = s_{i}o_{d}, \end{aligned}$$

(7)

where s indicates receptive filed scales, o refers to orientation—that is, vertical (v), horizontal (h), and diagonal (d)—and i indicates the number of neurons/features. The V1 complex cells can be formulated as:

$$\begin{aligned} V_\textrm{complex} = \sum _{1}^{i} (s_{i}o_{v})^2 + \sum _{1}^{i} (s_{i}o_{h})^2 + \sum _{1}^{i} (s_{i}o_{d})^2. \end{aligned}$$

(8)

2.4.2 Wavelet transform and wavelet energy map

The wavelet image analysis can decompose an image into multi-scale and multi-orientation features, similar to the visual cortex representation. Compared to the Fourier transform (FT), a wavelet transform can represent spatial and frequency information simultaneously. Alfred Haar first proposed the wavelet transform approach, and it has already been widely used in signal analysis [43]; for example, for image compression, image denoising, and classification. Wavelet transforms have already been applied in visual saliency map prediction, and achieved good performance [44]. However, wavelet energy maps remain barely used in visual saliency map prediction, and they can be used to enhance local contrast information in the decomposition sub-bands. In our proposed model, we use a discrete wavelet transform (DWT), which can be written in math as:

$$\begin{aligned} r[n]= ((I * f)[n]) \downarrow 2 = \left( \sum _{k=-\infty }^{\infty } I[k] f[n-k]\right) \downarrow 2 \end{aligned}$$

(9)

where I indicates the input images, f represents a series of filter banks (low-pass and high-pass), and $\downarrow 2$ indicates down-sampling until the next layer’s signal cannot be decomposed any more (see Fig. 4). A series of sub-band images are produced after convolution with the DWT; then, the wavelet energy map can be calculated from each sub-band feature (see Fig. 5).

The wavelet energy map can be expressed as:

$$\begin{aligned} \mathcal{W}\mathcal{E}(i, j)=\Vert I(i, j)\Vert ^{2}=\sum _{k=1}^{3ind+1}\left| I_{k}(i, j)\right| ^{2}, \end{aligned}$$

(10)

where 3ind indicates the maximum level of an image that can be decomposed in the last level, e.g., 3 level decomposition, and $I_{k}(i, j)^{2}$ represents the energy map of each sub-band feature.

2.5 Contrast sensitivity function

The human visual system is sensitive to contrast changes in natural environments. The visual cortex function can be decomposed into subset compositions, where one of the significant features is the CSF, which can be divided into achromatic and chromatic spatial CSFs [45]. In this proposed computational model, an achromatic CSF (aCSF) and chromatic CSFs (rgCSF and ybCSF) were implemented, which was first proposed by Mannos and Sakrison in 1974 [46], and further improved later [47, 48] (see Fig. 6). The achromatic CSF mathematics is as follows:

$$\begin{aligned}{} & {} {\text {CSF}}(f_{x}, f_{y})=Q(f) * L(f_{x}, f_{y}), \end{aligned}$$

(11)

$$\begin{aligned}{} & {} \quad Q(f)=g *\left( \exp (-(f/f_{m}))-l* \exp \left( -\left( f^{2} / s^{2}\right) \right) \right) , \nonumber \\\end{aligned}$$

(12)

$$\begin{aligned}{} & {} \quad L(f_{x}, f_{y}){=}\nonumber \\{} & {} 1{-}w *\left( 4(1-\exp (-(f/os))) * f_{x}^{2} * f_{y}^{2}\right) /f^{4}), \end{aligned}$$

(13)

where $(f_{x},f_{y})$ indicates a 2D spatial frequency vector (in cycle/deg), f represents the modulus of the spatial frequency (cycle/deg), g represents the overall gain ($g=330.74$), $f_{m}$ is a parameter that controls the exponential decay of the CSF Tyler ($f_{m}=7.28$), l represents the loss at low frequencies ($l=0.837$), s is a parameter that controls the attenuation of the loss factor at high frequencies ($s=1.809$), w indicates the weighting of the oblique effect ($w=1$), and os indicates the oblique effect scale ($os=6.664$). The CSFs were applied to the wavelet energy image in the Fourier domains. It can be described by the following formula:

$$\begin{aligned} CSF_{WE} = real({\mathcal {F}}({\mathcal {I}}(\textrm{I}(\textrm{F}(WE.real)) \odot CSF))), \end{aligned}$$

(14)

where $\textrm{F}$ indicates the 2D Fourier transform, ${\mathcal {F}}$ indicates the 2D inverse Fourier transform, $\textrm{I}$ indicates fftshift, which rearranges a Fourier transform by shifting the zero-frequency component to the center of the image, and ${\mathcal {I}}$ indicates ifftshift, which rearranges a zero-frequency-shifted Fourier transform back to the original transform output. In other words, ifftshift undoes the result of fftshift. The Python implementations of the above CSFs (aCSF, rgCSF, and ybCSF) are available at https://github.com/sinodanishspain/CSFpy.

3 Materials and methods

3.1 datasets

The proposed model was tested on several well-known datasets, including MIT1003, MIT300, TORONTO, and SID4VAM. The following sections introduce the basic information of each dataset.

MIT1003 is an image dataset that includes 1003 images from the Flickr and LabelMe collections. The fixation map was generated by recording the eye-tracking data of 15 participants. It is the largest eye-tracking dataset. The dataset includes 779 landscape and 228 portrait images with sizes spanning from $405\times 405$ to $1024\times 1024$ pixels [49].
MIT300 is a benchmark saliency test dataset that includes 300 images obtained by recoding a 39-observer eye-tracking dataset. The MIT300 dataset categories are highly varied and natural. The dataset can be used for model evaluation [49].
TORONTO includes 120 chromatic images free-viewed by 20 subjects. The dataset contains both outdoor and indoor scenes with a fixed resolution of $511\times 681$ pixels [8].
SID4VAM is a synthetic image database that is mainly used to psychophysically evaluate the V1 properties. This database is composed of 230 synthetic images, including 15 distinct types of low-level features (e.g., brightness, size, color, and orientation) with different target-distractor pop-out-type synthetic images [50].

3.2 Evaluation metrics

As mentioned before, there are several approaches to evaluate metrics between visual saliency and model prediction. In general, saliency evaluation can be divided into two branches: location-based and distribution-based. The former mainly focuses on the district located in the saliency map, and the latter considers both the predicted saliency and human eye fixation maps as continuous distributions [51]. In this research, we used AUC, NSS, CC, SIM, IG, and KL to evaluate the methods and details of each evaluation metric, as described in the following.^{Footnote 4}

3.2.1 Area under the ROC curve (AUC)

The AUC metric is a popular approach for the evaluation of saliency model performance. The saliency map can be treated as a binary classifier to split positive samples from negative samples by setting different thresholds. The true positive (TP) is the proportion of salient map values beyond a specific threshold at the fixation locations. In contrast, the false positive (FP) is the proportion of salient map values beyond a specific threshold at the non-fixation locations. In our case, the thresholds were set from the saliency map values and the AUC-Judd, AUC-Borji, and sAUC measures [52].

3.2.2 Normalized scanpath saliency (NSS)

The NSS metric usually measures the relationship between human eye fixation maps and model-predicted saliency maps [53]. The NSS can be formally defined as follows given a binary fixation map F and a saliency map S:

$$\begin{aligned}{} & {} N S S=\frac{1}{N} \sum _{i=1}^{N} {\bar{S}}(i) \times F(i), \end{aligned}$$

(15)

$$\begin{aligned}{} & {} \quad N=\sum _{i} F(i) \text{ and } {\bar{S}}=\frac{S-\mu (S)}{\sigma (S)}, \end{aligned}$$

(16)

where N is the total number of human eye positions, $\mu (s)$ is the mean value of saliency maps, and $\sigma (S)$ is the standard deviation.

3.2.3 Similarity metric (SIM)

The similarity metric (SIM) is a very famous algorithm for measuring image structure similarity, which has already been widely used in image quality and image processing disciplines [54]. The SIM mainly measures the normalized probability distributions of eye fixation and model-predicted saliency maps. The SIM can be mathematically described as:

$$\begin{aligned} SIM=\sum _{i=1} \min (P(i), Q(i)), \end{aligned}$$

(17)

where P(i) and Q(i) are the normalized saliency map and the fixation map, respectively. A similarity score should be in the range between zero and one.

3.2.4 Information gain (IG)

Information gain is an approach to measuring saliency map prediction accuracy from an information-theoretic view. It mainly measures the critical information contained in the predicted saliency map, compared with a ground-truth map [55]. The mathematical formula for the IG can be expressed as:

$$\begin{aligned}{} & {} I G\left( P, Q^{B}\right) =\frac{1}{N} \sum _{i} Q_{i}^{B}\nonumber \\{} & {} \quad \left[ \log _{2}\left( \epsilon +P_{i}\right) -\log _{2}\left( \epsilon +B_{i}\right) \right] , \end{aligned}$$

(18)

where P indicates the predicted saliency map, $Q^{B}$ is the baseline map, and $\epsilon $ represents a regularity parameter.

3.2.5 Pearson’s correlation coefficient (CC)

Pearson’s correlation coefficient (CC) is a linear approach that measures how many similarities there are between the predicted saliency map and the baseline map [56].

$$\begin{aligned} C C\left( P, Q^{D}\right) =\frac{\sigma \left( P, Q^{D}\right) }{\sigma (P) \times \sigma \left( Q^{D}\right) }, \end{aligned}$$

(19)

where P indicates the predicted saliency map and $Q^{D}$ is the ground-truth saliency map.

3.2.6 Kullback–Leibler divergence (KL)

The Kullback–Leibler divergence (KL) is used to measure the distance between the samples of two distributions from an information-theoretic perspective [55]. It can be formally defined as:

$$\begin{aligned} KL\left( P, Q^{D}\right) =\sum _{i} Q_{i}^{D} \log \left( \epsilon +\frac{Q_{i}^{D}}{\epsilon +P_{i}}\right) , \end{aligned}$$

(20)

where P indicates the predicted saliency map, $Q^{D}$ is the ground-truth saliency map, and $\epsilon $ represents a regularity parameter.

3.2.7 Other metrics

We also evaluated the performance of different salient prediction models through two main metrics: precision-recall curves (PR curves) and the F-measure.^{Footnote 5} By binarizing the predicted saliency map with thresholds in [0,255], a series of precision and recall score pairs were calculated for each dataset image. The PR curve was plotted using the average precision and recall of the dataset under different thresholds [57].

Table 1 Quantitative scores of several models for the MIT1003 dataset. The baseline ITT model is shaded in light blue and the proposed model is shown in green

Full size table

Table 2 Quantitative scores of several models for the TORONTO dataset. The baseline ITT model is shown in light blue and the proposed model is shown in green

Full size table

Table 3 Quantitative scores of several models for the SID4VAM dataset. The baseline ITT model is shown in light blue and the proposed model is shown in green

Full size table

4 Experimental results

4.1 Quantitative comparison of the proposed model with other state-of-the-art models

To evaluate the performance of the proposed model, we compared it with eight other state-of-the-art models. For comparison of the quantitative results, we selected the MIT1003 and SID4VAM benchmarks. These results are reported in Tables. 1, 2, and 3. The superior performance, in terms of saliency prediction, was achieved by models based on biological/cognitive and Fourier/spectral foundations. Our model achieved stable and superior performance, in terms of different evaluation metrics compared to other biological/cognitive- and Fourier/spectral-inspired models. However, saliency map prediction based on a convolutional neural network outperformed other models in natural scene images, as more images were used to train the neural network. Consequently, these cannot be compared to other models, as they are based more on statistical than neuroscientific principles. In this paper, we emphasize understanding saliency prediction from a neuroscience perspective, in order to further help us understand the mechanism of visual attention cognitive function. Furthermore, biological/cognitive- and Fourier/spectral-inspired saliency detection models were outperformed by deep learning approaches (ML_Net and DeepGazeII) in the SID4VAM dataset (see Tab. 3 and Fig. 10). As previously said, SID4VAM is a synthetic image database that is primarily used to psychophysically test the V1 properties, which is also why we stated that deep learning models refer more to statistics than neuroscience in the explanation of human visual attention mechanisms.

4.2 Qualitative comparison of the proposed model with other state-of-the-art models

We qualitatively tested the proposed model using the MIT1003, MIT300, TORONTO, SID4VAM, and UCF Sports datasets.^{Footnote 6} We also compared the model’s performance with that of other state-of-the-art saliency prediction models on the MIT1003, TORONTO, and SID4VAM datasets. Figures 7, 8, 9, 10, and 12 show the saliency map results when the proposed model and other state-of-the-art models were applied to sample images from the studied datasets. The performance each of saliency prediction model was evaluated through AUC and PR curves, as shown in Fig. 11. We can see that the proposed model could predict most of the salient objects in the given images. Furthermore, the proposed model could successfully detect the orientation, boundary, and pop-out functions when the model was applied to the SID4VAM dataset. In summary, our proposed biological/Fourier/spectral-inspired saliency prediction model achieved superior and stable performance on natural images, psychophysical synthetic images, and dynamic scenes, compared with other existing models.

4.3 Ablation study

Here, we will explore the efficacy of the main components in this ablation study, and we will only measure the model performance with AUC_Judd metric on the MIT1003 dataset. The color appearance model, wavelet transform, wavelet energy map, and contrast sensitivity functions used for image feature extraction are firstly evaluated for their efficiency. Our method’s performance has been stable and consistent. However, without using any of the modules, for example, remove color opponent channels, wavelet transform, wavelet energy map, and contrast sensitivity functions, performance degradation occurs (see Tab. 4).

5 Discussion

Saliency modeling has become a well-known area of study in computer vision and neuroscience. As a result, academics have attempted to employ different architectures that have not significantly increased model performance. Instead, we should look into how humans perceive scenes; what draws their attention is crucial. In this study, we addressed numerous methods for going beyond the capabilities of such models.

First, it is vital to comprehend the cognitive attention mechanism. Visual attention is a selective cognitive process that helps us deal with this issue successfully by focusing on key information while disregarding unnecessary information. Spatial attention is important in discrimination and appearance tasks in saliency prediction studies. Investigating the brain underpinnings of visual attention can help us design better and more accurate saliency prediction models. On the other hand, better saliency prediction models can assist us in comprehending the cognitive process of visual attention in the brain. Second, we require multi-model and multi-label datasets to assess the effectiveness of saliency prediction models. The MIT Saliency benchmark datasets were used to evaluate the saliency prediction model. As previously stated, the saliency prediction model should perform better on natural scene photographs and psychophysical synthetic images (e.g., SID4VAM). This could aid in the improvement of the model architecture and our understanding of the cognitive attention process. Third, we offered extensive experimental findings demonstrating that our method consistently achieved steady and better results when compared to other state-of-the-art technologies. It is worth noting that we used biologically inspired visual model estimation to determine saliency. Our proposed saliency prediction model incorporates more neuroscience notions than statistical concepts.

The proposed model incorporates opponent color channels, the wavelet transform, the wavelet energy map, and the contrast sensitivity function, but our model ignores the fact that natural images have been preprocessed multiple times, from retinal response to LGN response, for better saliency prediction accuracy and a close approximation of the human visual system. In addition, other wavelet transform families [22](e.g., Symlets, Morlet, Mexican Hat, Meyer, and Steerable pyramid etc.) may be worth investigating in the future. Saliency prediction in dynamic videos may benefit from converting the spatio-chromatic contrast sensitivity functions to a spatio-temporal one, for instance, white-black, red-green, and yellow-blue channel. Most importantly, we must understand how to properly optimize model parameters. This is one of the most crucial things we can do to improve a model’s ability to perform multiple tasks. Furthermore, the suggested model contains a small number of parameters that were essentially set and fixed for all tests. One of the most important things that contributed to the suggested method’s efficiency was the use of wavelet energy maps and opponent CSFs as features. Furthermore, our extensive experimental results showed that the proposed saliency prediction measure generated from a local image energy estimator is far more successful and straightforward to implement than existing methods. Even though our method was built solely on biologically inspired computational principles, the resulting model structure showed significant agreement with the fixation behavior of the human visual system.

The contribution of this study is limited, and there are some limitations to this study. First, the proposed model is inspired by the human low-level vision system, and each component of the model is already used separately in various saliency models. The proposed model architecture is based primarily on the simulated multi-channel coding principle and integrates and separates image features at different components; additionally, in the final stage of the model, we applied spatio-chromatic CSFs to each channel, which is the first time CSFs have been applied to each channel rather than an achromatic channel. Moreover, as we stated before, the traditional saliency prediction model includes both non-neural network and deep neural network-based saliency prediction models; all of these models only focus on natural image saliency prediction and completely ignore psychophysical synthetic images. As demonstrated in the results, deep neural networks outperform traditional (non-neural network) saliency prediction models in natural image saliency prediction, but they have some limitations in psychophysical synthetic images, where, surprisingly, traditional (non-neural network) saliency prediction models outperform deep neural networks. Furthermore, deep neural networks became more unstable in the presence of distracting noise in natural images, while non-neural network saliency prediction remained more reliable [58]. Bowers et al. [59] showed more problems with using deep neural networks to model the human vision system, and all of these suggested that deep neural networks have to be tested with more broadly defined tasks to improve their generality.

The extension study can be examined in the future following this study. First, the noise or other degradation processing is applied to natural images or psychophysical synthetic images to completely check how they affect the saliency prediction results, and it is very important for us to improve the performance of saliency prediction models in real complex environments, for example, in a fog or rain situation. Second, the saliency performance of both natural images and psychophysical synthetic images can be checked with vision transforms to see whether they have some similar behavior compared to deep nets. Third, concerning training psychophysical synthetic images with deep neural nets, we need large psychophysical synthetic images, which will be more fair for deep nets and can better help us check their performance in psychophysical synthetic images and give us more evidence of the similarities and differences between deep nets and human vision.

Table 4 Ablation study of the proposed model. The module’s role in the saliency prediction model on the MIT1003 dataset

Full size table

6 Conclusions

In this study, a computational psychophysical visual saliency prediction model inspired by a low-level human visual pathway was proposed. The model includes color opponent channels, wavelet transform, wavelet energy map, and contrast sensitivity functions in order to predict the saliency map. The model was evaluated by classical benchmark datasets and achieved strongly stable and better performance in terms of visual saliency prediction, compared with the baseline model. Furthermore, we found that models based on deep neural networks outperformed ours in terms of natural image salience prediction but underperformed for psychophysical synthetic images. In contrast, Fourier/spectral-inspired models had the opposite effect, as Fourier/spectral-inspired models simulate optical neural processing from the retina to the V1. However, deep neural networks take statistics into account more than low-level vision system functioning, and we argue that deep neural networks cannot reliably predict salient performance on psychophysical synthetic images without using specialized architectures and goals. Lastly, we added spatial-temporal saliency prediction to our model, and it was able to pick out the most important thing in the videos.

Data and Code availability

The code performs main part of experiments described in this article are available at project page: https://sinodanishspain.github.io/HVS_SaliencyModel/.

Notes

Abbreviations

HVS:: Human Vision System
V1:: Primary Visual Cortex
ICL:: Incremental Coding Length
CNN:: Convolution Neural Network
DNN:: Deep Neural Network
WT:: Wavelet Transform
IWT:: Inverse Wavelet Transform
AUC:: Area Under Curve
NSS:: Normalized Scanpath Saliency
CC:: Pearson’s Correlation Coefficient
SIM:: Similarity or Histogram Intersection
IG:: Information Gain
KL:: Kullback–Leibler Divergence
CSFs:: Contrast Sensitivity Functions
FT:: Fourier Transform
DWT:: Discrete Wavelet Transform
IDWT:: Inverse Discrete Wavelet Transform
LGN:: Lateral Geniculate Nucleus
GT:: Ground Truth

References

Treisman, A., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 12, 97–136 (1980)
Article Google Scholar
Barlow, H.: Sensory mechanisms, the reduction of redundancy, and intelligence. Mech. Thought Proc. 10, 535–539 (1959)
Google Scholar
Wang, D., Kristjansson, A., Nakayama, K.: Efficient visual search without top-down or bottom-up guidance. Percept. Psychophys. 67, 239–53 (2005)
Article Google Scholar
Itti, L.: Models of bottom-up and top-down visual attention, Ph.D. dissertation, Pasadena, California, Jan (2000)
Sun, Y., Fisher, R.: Object-based visual attention for computer vision. Artif. Intell. 146, 77–123 (2003)
Article MathSciNet MATH Google Scholar
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20, 1254–1259 (1998)
Article Google Scholar
Achanta, R., Hemami, S., Estrada, F., Süsstrunk, S.: Frequency-tuned salient region detection. In: Proceedings of the IEEE Conference on Computer Vision Pattern Recognition (CVPR), Jun. (2009)
Bruce, N.D.B., Tsotsos, J.K.: Saliency based on information maximization, In: Proceedings of the 18th International Conference on Neural Information Processing Systems, ser. NIPS’05, Vancouver, British Columbia, Canada: MIT Press, pp. 155–162 (2005)
Li, J., Levine, M., An, X., Xu, X., He, H.: Visual saliency based on scale-space analysis in the frequency domain. IEEE Trans. Pattern Anal. Mach. Intell. 35, 996–1010 (2012)
Article Google Scholar
Hou, X., Zhang, L.: Dynamic visual attention: searching for coding length increments. Adv. Neural Inf. Process. Syst 21, 681–688 (2008)
Google Scholar
Murray, N., Vanrell, M., Otazu, X., Párraga, C.A.: Saliency estimation using a non-parametric low-level vision model, In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 433–440 (2011)
Zhang, J., Sclaroff, S.: Saliency detection: A boolean map approach, In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), Dec. pp. 153–160 (2013)
Pinna, B., Reeves, A., Koenderink, J., Doorn, A., Deiana, K.: A new principle of figure-ground segregation: the accentuation. Vis. Res. 143, 9–25 (2017)
Article Google Scholar
Hou, X., Harel, J., Koch, C.: Image signature: highlighting sparse salient regions. IEEE Trans. Pattern Anal. Mach. Intell. 34, 194–201 (2011)
Google Scholar
Goferman, S., Zelnik-Manor, L., Tal, A.: Context-aware saliency detection, In: Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2376–2383 (2010)
Hou, X., Zhang, L.: Saliency detection: a spectral residual approach, In: IEEE Conference in Computer Vision and Pattern Recognition, vol. 2007, (2007)
Guo, C., Ma, Q.: Zhang, L.: Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform, In: IEEE Conference in Computer Vision and Pattern Recognition, (2008)
Schauerte, B., Stiefelhagen, R.: Quaternion-based spectral saliency detection for eye fixation prediction, In: European Conference on Computer Vision (ECCV), Oct. pp. 116–129, ISBN: 978-3-642-33708-6 (2012)
Murray, N., Vanrell, M., Otazu, X., Párraga, C.A.: Low-level spatiochromatic grouping for saliency estimation. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2810–6 (2013)
Article Google Scholar
Otazu, X., Pàrraga, C.A., Vanrell, M.: Toward a unified chromatic induction model. J. Vis. 10, 5 (2010)
Article Google Scholar
Seo, H., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vis. 9, 15–27 (2009)
Article Google Scholar
Louis, A., Maass, P., Rieder, A.: Wavelets: theory and Applications. Jan. ISBN: 978-0-471-96792-7 (1997)
Selvaraj, A., Shebiah, N.: Object recognition using wavelet based salient points. Open Signal Process. J. 2, 14–20 (2009)
Article Google Scholar
Spratling, M.: Predictive coding as a model of the v1 saliency map hypothesis. Neural Netwo. Offi. J. Int. Neural Netw. Soc. 26, 7–28 (2011)
Article Google Scholar
Rao, R., Ballard, D.: Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 2, 79–87 (1999)
Article Google Scholar
Borji, A.: Saliency prediction in the deep learning era: successes and limitations. IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 679–700 (2019)
Article Google Scholar
Kruthiventi, S., Gudisa, V., Dholakiya, J., Babu, R.: Saliency unified: A deep architecture for simultaneous eye fixation prediction and salient object segmentation, Jun. pp. 5781–5790 (2016)
Cong, R., Lei, J., Fu, H., Cheng, M.-M., Lin, W., Huang, Q.: Review of visual saliency detection with comprehensive information. IEEE Trans. Circuits Syst. Video Technol. 29(10), 2941–2959 (2019)
Article Google Scholar
Cong, R., Lei, J., Fu, H., Porikli, F., Huang, Q., Hou, C.: Video saliency detection via sparsity-based reconstruction and propagation. IEEE Trans. Image Process. 28(10), 4819–4831 (2019)
Article MathSciNet MATH Google Scholar
Fang, C., Tian, H., Zhang, D., Zhang, Q., Han, J., Han, J.: Densely nested top-down flows for salient object detection. Sci. China Inf. Sci. 65(8), 1–14 (2022)
Article MathSciNet Google Scholar
Zhang, D., Han, J., Zhang, Y., Xu, D.: Synthesizing supervision for learning deep saliency network without human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 42(7), 1755–1769 (2020)
Article Google Scholar
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: 23rd International Conference on Pattern Recognition (ICPR), pp. 3488–3493 (2016)
Kümmerer, M., Wallis, T. S. A., Bethge, M.: Deepgaze II: reading fixations from deep features trained on object recognition, CoRR, vol. abs/1610.01563, arXiv: 1610.01563. [Online]. Available: (2016)
Butz, M.: Toward a cognitive sequence learner: hierarchy, self-organization, and top-down bottom-up interaction, Apr. [Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.177.6739 &rep=rep1 &type=pdf (2004)
Finlayson, G., Drew, M., Funt, B.: Color constancy: Enhancing von Kries adaptation via sensor transformations, Proc SPIE, vol. 1913, Sep (1993)
Finlayson, G.D., Alsam, A., Hordley, S.D.: Local linear models for improved von kries adaptation, In: The Tenth Color Imaging Conference: Color Science and Engineering Systems, Technologies, Applications, CIC,: Scottsdale, Arizona, USA, November 12–15, 2002, IS &T - The Society for Imaging. Science and Technology 2002, pp. 139–144 (2002)
Krauskopf, J., Gegenfurtner, K.: Color discrimination and adaptation. Vis. Res. 32, 2165–75 (1992)
Article Google Scholar
Brill, M.: Trichromatic theory, pp. 827–829, ISBN: 978-0-387-30771-8 (2014)
Shapley, R., Hawken, M.: Color in the cortex: single- and double-opponent cells. Vis. Res. 51, 701–17 (2011)
Article Google Scholar
Hering, E.: Outlines of a Theory of the Light Sense. Harvard University Press, Cambridge (1920)
Google Scholar
Hurvich, L., Jameson, D.: An opponent-process theory of color vison. Psychol. Rev. 64, 384–404 (1957)
Article Google Scholar
Zhaoping, L.: A neural model of contour integration in the primary visual cortex. Neural Comput. 10, 903–40 (1998)
Article Google Scholar
Haar, A.: Zur theorie der orthogonalen funktionensysteme. (zweite mitteilung)., Mathematische Annalen, vol. 71, pp. 38–53, [Online]. Available: http://eudml.org/doc/158516 (1912)
Imamoğlu, N., Lin, W., Fang, Y.: A saliency detection model using low-level features based on wavelet transform. IEEE Trans. Multimedia 15, 96–105 (2013)
Article Google Scholar
Mullen, K.: The contrast sensitivity of human color vision to red-green and blue-yellow chromatic gratings. J. Physiol. 359, 381–400 (1985)
Article Google Scholar
Mannos, J., Sakrison, D.: The effects of a visual fidelity criterion of the encoding of images. IEEE Trans. Inf. Theory 20(4), 525–536 (1974)
Article MATH Google Scholar
Watson, A., Malo, J.: Video quality measures based on the standard spatial observer, In: Proceedings International Conference on Image Processing (ICIP), vol. 3, Feb. pp. III–41, ISBN: 0-7803-7622-6 (2002)
Watson, A., Ahumada, A.: The spatial standard observer. J. Vis. 4, 51–51 (2010)
Article Google Scholar
Judd, T., Durand, F., Torralba, A.: A benchmark of computational models of saliency to predict human fixations, In: MIT Technical Report (2012)
Berga, D., Fdez-Vidal, X. R., Otazu, X., Pardo, X. M.: Sid4vam: A benchmark dataset with synthetic images for visual attention modeling, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8788– 8797 (2019)
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? arXiv preprint arXiv:1604.03605 (2016)
Borji, A., Rezazadegan Tavakoli, H., Sihite, D., Itti, L.: Analysis of scores, datasets, and models in visual saliency prediction, In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), Dec, pp. 921–928 (2013)
Emami, M., Hoberock, L.: Selection of a best metric and evaluation of bottom-up visual saliency models. Image Vis. Comput. 31, 796–808 (2013)
Article Google Scholar
Riche, N., Duvinage, M., Mancas, M., Gosselin, B., Dutoit, T.: Saliency and human fixations: State-of-the-art and study of comparison metrics, In: Proceedings of the IEEE International Conference on Computer Vision, (2013)
Kümmerer, M., Wallis, T., Bethge, M.: Information-theoretic model comparison unifies saliency metrics. Proc. Natl. Acad. Sci. 112, 201 510 393 (2015)
Article Google Scholar
Jost, T., Ouerhani, N., Wartburg, R., Müri, R., Hügli, H.: Assessing the contribution of color in visual attention. Comput. Vis. Image Underst. 100, 107–123 (2005)
Article Google Scholar
Feng, M.: Evaluation toolbox for salient object detection., https://github.com/ArcherFMY/sal_eval_toolbox (2018)
Li, Q.: Understanding saliency prediction with deep convolutional neural networks and psychophysical models (2022)
Bowers, J.S., et al.: Deep problems with neural network models of human vision. Behav. Brain Sci., 1–74 (2022)
Riche, N., Mancas, M., Gosselin, B., Dutoit, T.: Rare: A new bottom-up saliency model, In: Image Processing, 19th IEEE Conference on (IEEE), (2012)
Zhang, L., Tong, M., Marks, T., Shan, H., Cottrell, G.: Sun: a Bayesian framework for saliency using nature statistics. J. Vis. 8, 32 (2008)
Article Google Scholar
Harel, J.: A saliency implementation in matlab, http://www.klab.caltech.edu/~harel/share/gbvs.php (2012)

Download references

Acknowledgements

I thank the anonymous reviewers, whose suggestions helped to improve and clarify this manuscript. This work was partially funded by GVA Grisolía-P/2019/035.

Author information

Authors and Affiliations

Image Processing Laboratory, University of Valencia, Valencia, Spain
Qiang Li
Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS), Atlanta, USA
Qiang Li

Authors

Qiang Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiang Li.

Ethics declarations

Conflict of interest

The author declares no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Fig. 5.

Table 5 Saliency prediction models. The model function categories that inspired the developed corresponding models are shown on the right side of the table. Most saliency prediction models were inspired by biological/cognitive, Fourier/spectral, information-theoretic, and probabilistic principles

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, Q. Saliency prediction based on multi-channel models of visual processing. Machine Vision and Applications 34, 47 (2023). https://doi.org/10.1007/s00138-023-01405-2

Download citation

Received: 12 March 2022
Revised: 23 January 2023
Accepted: 19 April 2023
Published: 08 May 2023
DOI: https://doi.org/10.1007/s00138-023-01405-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Saliency prediction based on multi-channel models of visual processing

Abstract

Similar content being viewed by others

Visual Attention Model Based on Statistical Properties of Neuron Responses

Predictive Coding with Context as a Model of Image Saliency Map

Human Vision Attention Mechanism-Inspired Temporal-Spatial Feature Pyramid for Video Saliency Detection

Explore related subjects

1 Introduction

2 The proposed saliency prediction model

2.1 Saliency prediction model

2.2 Gain control with von Kries chromatic adaptation model

2.3 Color appearance model

2.4 Wavelet energy map

2.4.1 Visual cortex receptive fields with wavelet filters

2.4.2 Wavelet transform and wavelet energy map

2.5 Contrast sensitivity function

3 Materials and methods

3.1 datasets

3.2 Evaluation metrics

3.2.1 Area under the ROC curve (AUC)

3.2.2 Normalized scanpath saliency (NSS)

3.2.3 Similarity metric (SIM)

3.2.4 Information gain (IG)

3.2.5 Pearson’s correlation coefficient (CC)

3.2.6 Kullback–Leibler divergence (KL)

3.2.7 Other metrics

4 Experimental results

4.1 Quantitative comparison of the proposed model with other state-of-the-art models

4.2 Qualitative comparison of the proposed model with other state-of-the-art models

4.3 Ablation study

5 Discussion

6 Conclusions

Data and Code availability

Notes

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation