Keywords

1 Introduction

The visual system pays attention to the salient object. A number of psychophysical experiments suggest that primary visual cortex (V1) may be involved in the computation of visual salience. Spratling introduced the nonlinear predictive coding/biased competition (PC/BC) model [1], a reformulation of predictive coding consistent with the biased competition theory of attention, that can simulate a very wide range of V1 response properties including tuning and suppression [2, 3]. The paper [4] extends his previous work by showing that the PC/BC model of V1 can also simulate a wide range of psychophysical experiments on visual salience, and demonstrates that PC/BC provides a possible implementation of the V1 bottom-up saliency map hypothesis. It proposes that the perceptual saliency of the image is consistent with the relative strength of the prediction error calculated by PC/BC. Saliency can therefore be interpreted as a mechanism by which prediction errors attract attention in an attempt to improve the accuracy of the brain’s internal representation of the world [4].

Visual saliency plays important roles in natural vision in that saliency can direct eye movements, deploy attention, and facilitate tasks like object detection and scene understanding. Many models have been built to compute saliency map. There are two major categories of factors that drive attention: bottom-up factors and top-down factors [5]. Bottom-up factors are derived solely from the visual scene. Regions of interest that attract our attention are in a bottom-up way and the responsible feature for this reaction must be sufficiently discriminative with respect to surrounding features. Most computational models focused on bottom-up attention, where the subjects are free-viewing a scene and salient objects attract attention. Inspired by the feature-integration theory [6], Itti et al. [7] proposed one of the earliest bottom-up selective attention models by utilizing color, intensity, and orientation of images. Bruce et al. [8] introduced an idea of using Shannon’s self-information to measure the perceptual saliency. Saliency using natural image statistics (SUN) is a bottom-up bayesian framework [9]. Recently, Hou et al. [10] proposed a dynamic visual attention approach to calculate the saliency map based on Incremental Coding Length (ICL). Bottom-up attention can be biased toward targets of interest by top-down cues such as object features, scene context and task-demands. Bottom-up and top-down factors should be combined to direct attentional behavior. A recent review of attention models from computational perspective can be found in [11].

Reference [4] uses synthetic stimuli to test the saliency of the PC/BC model. In this paper, inspired by the work of Spratling, we propose an approach toward natural color images saliency detection via the PC/BC model with top-down cortical feedback as context. We compare our method with the five state-of-the-art models of saliency detectors. Experimental results show that our method performs competitively for visual saliency detection task. The rest of this paper is organized as follows. Section 2 introduces and analyzes Spratling’s PC/BC model, and based on his work, a novel method combining top-down cortical feedback for measuring image saliency is proposed. Experimental results and comparisons with state-of-the-art models are presented in Sect. 3, and discussions are given in Sect. 4.

2 The Model Description

Figure 1 illustrates the retina/LGN model and the PC/BC model of V1, from left to right, capital characters I, X, E, Y, and A represent input image, image preprocessing stage by the retina/LGN, the error-detecting neurons, the prediction neurons, feedback from higher cortical regions, respectively.

Fig. 1
figure 1

The retina/LGN model and the PC/BC model of V1

2.1 The Retina/LGN Model

To simulate the effects of circular-symmetric center-surround receptive fields (RFs) in lateral geniculate nucleus (LGN) and retina, input image (I) preprocessed by convolution with a Laplacian-of-Gaussian (LoG) filter (l) and a saturating nonlinearity:

$$ X = \tanh \{ 2\pi (I * l)\} . $$
(1)

The positive and rectified negative responses were separated into two images X ON and X OFF simulating the outputs of cells in retina and LGN with on-center/off-surround and off-center/on-surround RFs, respectively. These ON- and OFF-channels provided the input to the PC/BC model of V1.

2.2 The V1 Model

The PC/BC model of V1 is described by the following equations:

$$ E_{O} = X_{O} \varnothing \left( {\varepsilon_{2} + \sum\limits_{k = 1}^{p} {(\hat{\omega }_{ok} } * Y_{k} )} \right). $$
(2)
$$ Y_{k} \leftarrow (\varepsilon_{1} + Y_{k} ) \otimes \sum\limits_{o}^{{}} {(\omega_{ok} } \circ E_{O} ). $$
(3)
$$ Y_{k} \leftarrow Y_{k} \otimes (1 + \eta A_{k} ). $$
(4)

where o ∈ [ON, OFF]; X o represents the input to the model of V1, E o represents the error-detecting neuron responses, Y k represents the prediction neuron responses, A k represents the weighted sum of top-down predictions, all of them are two-dimensional array, equal in size to the input image; \( \omega_{ok} \) is a two-dimensional kernel representing the synaptic weights for a particular class (k) of neuron normalized so that the sum of all the weights is equal to \( \psi ,\hat{\omega }_{ok} \) is a two-dimensional kernel representing the same synaptic weights as \( \omega_{ok} \) but normalized so that the maximum value is equal to \( \psi \), the Gabor function is used to define the weights of each kernel \( \omega_{ok} \) and \( \hat{\omega }_{ok} \) (a family of 32 Gabor functions with eight orientation (0°–157.5° in steps of 22.5°) and four phases (0°, 90°, 180°, and 270°) were used); p is the total number of kernels; \( \varepsilon_{1} ,\varepsilon_{2} ,\eta \) and \( \psi \) are parameters; \( \varnothing \) and \( \otimes \) indicate element-wise division and multiplication, respectively; o represents cross-correlation (which is equivalent to convolution without the kernel being rotated 180°); and * represents convolution (which is equivalent to cross-correlation with a kernel rotated 180°). Parameter values \( \psi = 5000,\varepsilon_{1} = \, 0.000 1,\varepsilon_{2} = { 25}0, \) and \( \eta = 1 \) were used in the experiments.

Equation (2) describes the calculation of the neural activity for each population of error-detecting neurons. The activation of the error-detecting neurons can be interpreted as representing the residual error between the input and the reconstruction of the input generated by the prediction neurons. The values of E are related to the image saliency, with high error values corresponding to high saliency.

Equation (3) describes the updating of the prediction neuron activations. The values of Y k represent predictions of the causes underlying the inputs to the model of V1. If the input remains constant, the values of Y k will converge to steady-state values that reconstruct the input with minimum error.

2.3 Modeling the Top-Down Effects

Equation (4) describes the effects on the V1 prediction neuron activations of top-down inputs from prediction neurons at later processing stages (i.e., in extra-striate cortical regions). In Eq. (4), the effects of cortical feedback are modeled by using an array of inputs (A) to the V1 model which represents the weighted sum of top-down predictions. In the simulations of Ref. [4], feedback was either simple orientation preferences, the elements of A were set to values of 0.25 and zero, or assumed to be negligible, the elements of A were given a value of zero, in which cases Eq. (4) had no effect. We add the following equation between Eq. (3) and Eq. (4) to model the top-down effects:

$$ A_{k} \leftarrow \sum\limits_{k = 1}^{p} {(\hat{\omega }_{ok} } * Y_{k} ). $$
(5)

This top-down feedback will have two effects on the PC/BC model of V1. (1) Increasing the response of the prediction neurons that represent information consistent with the top-down expectation [see Eq. (4)]. This will result in these prediction neurons sending stronger feed-forward activation, and hence, make this information more conspicuous for cortical regions at subsequent stages along the processing hierarchy. (2) The enhanced activity in the prediction neurons consistent with top-down expectations will in turn decrease the response of the error-detecting neurons from which these prediction neurons receive their input [see Eq. (2)] [4]. Since the strength of the responses of the error-detecting neurons is assumed to be related to saliency, in this way, top-down feedback modulates bottom-up saliency.

3 Experimental Comparisons

3.1 Saliency Results Comparison

We evaluated our method on human visual fixation data from natural images. The dataset we used was collected by Bruce and Tsotsos [8] as the benchmark dataset for comparing human eye predictions between methods. The dataset contains eye fixation data from 20 subjects for a total of 120 natural images.

Figure 2 affords a qualitative comparison of the output of the proposed models (without/with context) for a variety of images. Visually, top-down effects increase the performance of salient object detection, i.e., top-down signals modulate bottom-up saliency. This is in line with preceding analysis. Figure 2d is fixation density map based on experimental human eye tracking data as the “ground truth” saliency map of each image.

Fig. 2
figure 2

Results for qualitative comparison: a Original image; b Saliency map without context; c Saliency map with context; d Fixation density map based on experimental human eye tracking data

3.2 Comparing Our Saliency Results with Other Methods

We compare our saliency method with context against other five state-of-the-art methods using the database from the publicly available database used by Achanta et al. [12]. Each of the 1,000 images in the database contains a salient object or a distinctive foreground object, so we can compare the performance of different algorithms.

The five saliency detectors are Itti et al. [7], Harel et al. [13], Hou and Zhang [14], Achanta [12], and Goferman et al. [15], hereby referred to as IT, GB, SR, IG, and CA. We refer to our proposed method as PC. The choice of these algorithms is motivated by the following reasons: citation in literature (the classic approach of IT is widely cited), recency (IG, and CA are recent), and variety (IT is biologically motivated, CA is purely computational, GB is a hybrid approach, SR and IG estimates saliency in the frequency domain).

We randomly choose some images from the database. Figure 3 is the output of the five state-of-the-art methods and our method for comparison. Our method is a competitive, promising algorithm.

Fig. 3
figure 3

Visual comparison of saliency map. a Original, b IT [7], c GB [13], d SR [14], e IG[12], f CA [15], g PC

3.3 Quantitative Evaluation

To obtain a quantitative evaluation we compare ROC curves and Area Under Curve (AUC) on the database presented in [8]. Figure 4 is the result of our method and other three methods.

Fig. 4
figure 4

ROC curves for the database of [8]

4 Discussions

PC/BC is a computational model of primary visual cortex (V1) which provides an implementation of the V1 bottom-up saliency map. In this paper, we propose a novel approach to natural color image saliency detection method with top-down cortical feedback as context. Our experimental result is consistent with recent literature conclusion: top-down signals modulate (override) bottom-up saliency (in a feature-specific way) [16]. We compare our method with the five state-of-the-art models of saliency detectors. Experimental results show that our method performs competitively for visual saliency detection task.

When the organism is not actively searching for a particular target (the free-viewing condition), the organism’s attention should be directed to the most salient points which potential targets in the visual field. Bottom-up attention mechanisms have been more thoroughly investigated than top-down mechanisms. One reason is that data-driven stimuli are easier to control than cognitive factors such as task-demands, knowledge, and expectations. Even less is known on the interaction between the two processes [17].

In future work, we will incorporate color feature and other task-demands features as context to detect saliency, “Combining such features-specific top-down signals with (learnt) contextual priors on target location therefore may provide a promising approach to searching for real-world objects in their natural context [16]”, and develop applications of our model.