1 Introduction

Visual saliency reflects how much an image region or object stands out from its surrounding. Generally, it can be defined as what captures human perceptual attention. Salient object detection aims to build a saliency map for natural scene, and has a wide range of applications. For example, the detecting of salient object can be effectively used to automatically zoom the “interesting” areas [5] or automatically crop the “important” areas in an image [16]. Object recognition algorithms can use the results of saliency detection to quickly locate the position of visual salient objects. Salient object detection can also be used to reduce the interference of cluttered background to further improve the performance of image segmentation algorithm and image retrieval system [19].

A number of computational models for salient object detection have been developed in recent years. Typical model is the one proposed by Itti et al. [10], which depends only on low-level image features to determine salient object in image. This model is based on the Treisman theory [20] and the Koch neurobiological framework [12]. Following this model, many state-of-the-art saliency detection methods focus on low level features. Among these models, image color and contrast have been utilized as the most important and effective features to detect salient object [14, 17].

For example, the computational models proposed in [3, 13] and [6] built saliency map using image segmentation algorithms. These models divided the image into blocks according to the homogeneity of contrast and color features. However, the segmentation preprocessing is time consuming. Duan et al. [7] performed the uniform sampling on the RGB color feature maps without pre-running the complex segmentation algorithm. They used the principal component analysis (PCA) to extract the effective features of the image blocks. The image saliency can be calculated according to the global contrast between the image blocks and their spatial location. Borji et al. [4] put forward a prediction model to reflect the saliency discrimination towards eye tracking data. This model measured the scarcity of each block in both RGB and LAB color space, and then combined the local and global saliency of each color space to generate the saliency map. Achanta et al. [1, 2] used the luminance and color features to detect salient object. They calculated the contrast between local image region and its surrounding. The saliency map can be obtained by calculating the average color vector difference. Vu et al. [21] exploited image contrast in both spectral and spatial domain to measure the local perceived sharpness. The resulting sharpness maps generated by this model can well represent the saliency of the input images.

As stated above, many state-of-the-art salient object detection models compute image saliency primarily by measuring the contrast and color features of image. However, these models fail to fully consider the inter-relationship between the salient objects and the background from a global perspective. When the salient object and the background are similar in color and contrast, these models may fail to reflect image saliency. Aiming to address this problem, this paper proposes to fuse contrast based saliency with global color distribution. Because the global color distribution can compensate for the loss of saliency information caused by the similarity in color and contrast. Then, PCA was used to fuse the color saliency map and the contrast saliency map to retain the most significant information of object.

The remainder of this paper are organized as follows. Section 2 presents the proposed hierarchical salient object detection model. Experimental results obtained from the MSRA database are presented in Section 3. The conclusions are given in Section 4.

2 Proposed hierarchical salient object detection model

The visual properties of an object tend to be various, which will inevitably lead to its image features diversity. In order to fully and accurately describe an object, a variety of image features are chosen according to the visual properties of all aspects of the object. By simulating the process of people’s attention, it can be found that the image color and contrast play the key role. Therefore, we extracts these two critical features to detect image saliency.

2.1 Contrast based saliency

The conventional contrast measure [21] only considers the local characteristics of the image, while the global characteristics are neglected. Aiming to address this problem, we propose to extract image contrast by exploiting both local and global characteristics of the image region.

The proposed approach converts each image block to the spectral-domain. The Euclidean distance between the mean amplitude value m(b) of each image block b and the mean spectral value m(I) of the whole image I is used to obtain a new spectrum value F 1(b).

$$F_{1}(b) = \left\{\begin{array}{ll} \vert m(b)-m(I)\vert & \qquad \text{if~} \bigtriangleup(b) \leq T1 \text{~or~} \mu(b) \leq T2\\ 0 & \qquad \text{else} \end{array} \right. $$

where

$$ \bigtriangleup(b)=max(L(b))-min(L(b)) $$
(1)
$$ \mu(b)=mean(L(b)) $$
(2)

where △(b) represents the difference between the maximum luminance value and the minimum luminance value, and μ(b) represents the average luminance value, L(b) = (h+k b)η denotes the luminance-valued block, h=0.7656, k=0.0364, and η=2.2 is the Adobe RGB color space display conditions. For the thresholds T 1 and T 2, we can empirically set T 1=5 and T 2=2, assuming that image pixel values are between 0 and 255. By this means the low contrast image blocks are set to zero. Then the saliency value F 1(b) of all the image blocks are combined to form the contrast saliency map.

The resulting contrast saliency map can well represent the image saliency. However, it may face difficulties when the contrast difference between the salient object and the background is quite small. In order to extract image saliency, we perform the Fourier transformation to store both saliency information and non-saliency information in the forms of statistics. The Fourier spectrum (denoted as F(u,v)) of the image can be decomposed into amplitude spectrum and phase spectrum which can be expressed as:

$$ F(u,v)=\vert A(x,y)\vert e^{-jP(x,y)} $$
(3)

where A(x,y) and P(x,y) represent the amplitude spectrum and phase spectrum respectively, and can be given by:

$$ A(x,y)=\sqrt{R(x,y)^{2}+I(x,y)^{2}} $$
(4)
$$ P(x,y)=arctan\left(\frac{I(x,y)}{R(x,y)}\right) $$
(5)

where R(x,y) is the real component of Fourier spectrum and I(x,y) is the imaginary component.

As can be seen from the above formulas, the image signals can be characterized as the combination of amplitudes and the sine wave of phases. For the phase spectrum, it contains the texture information of the original image and can preserve important characteristics of the signal; while the amplitude spectrum can contain the chiaroscuro information of the original image.

Because the amplitude spectrum represents the signal magnitude and the specific gravity of sinusoidal components; while the phase spectrum information represents the location of these sinusoidal components which are the important parts of saliency information. As a result, the phase spectrum should remain intact when constructing the Fourier spectrum to largely preserve the integrity of the signal.

Then we analyze the amplitude spectrum of the image. The image can be viewed as the accumulation of object shadows in a homogeneous background. After the Fourier transformation, the projection information is broken down into a series of weighted sum of the plural fundamental waves. The fundamental waves of common characteristics (non-saliency features) in the original image usually account for a large proportion, while the fundamental waves of the fresh characteristics (saliency features) only account for a small proportion. Thus, we can inhibit the common characteristics and enhance the fresh characteristics of the original image by adjusting the amplitude spectrum.

The amplitude spectrum is the weighted sum of fundamental waves for various features. The larger weighted amplitude representing non-saliency features in the amplitude spectrum should be inhibited; while the smaller weighted amplitude representing saliency features in the amplitude spectrum should be enhanced. Accordingly, we can adjust the amplitude spectrum in order to inhibit the non-saliency information while retaining the saliency information at maximum extent. Therefore, we propose to adjust the amplitude spectrum A(x, y) via:

$$ A(x,y)=\vert A(x,y) - mean(A(x,y))\vert $$
(6)

For the image block B of scale r, the proposed method is further performed in a multi-scale strategy with three scales: \(\mathrm {R}={\mathrm {r},\frac {1}{2}\mathrm {r},\frac {1}{4}\mathrm {r}}\) to improve the contrast between the salient region and the non-salient region. The multi-scale processing is more accurate for locating the salient object because it can well suppress the features which occur frequently, and retain the features that deviate from the norm at the same time. Our multi-scale processing in the spectral-domain will not make significant increase in calculation, but will make significant contribution to the accuracy to detect the salient object. The resulting contrast maps using our method and [21] are shown in Fig. 1.

Fig. 1
figure 1

a1–a4 Input images, b1–b4 contrast saliency maps obtained by the proposed method, c1-c4 contrast saliency maps obtained by [21]

2.2 Color spatial distribution

The salient region is the area with strong contrast or variation. Accordingly, the color value in the salient region is relatively far from the average color value of the whole image; while the background region is the smooth area where the color value is closed to the average color value. Therefore, the salient object can be extracted by calculating the Euclidean distance of the average color value between each image block and the background region, in which the difference of the average color value is greater.

The LAB color space can provide a representation of color that corresponds to how human perceive chromatic differences. Thus, we convert the color space to a uniform LAB color space, and then divide the image (denoted as I) into blocks (denoted as b) with %50 overlap. Then the color saliency value F 2(b) of all image blocks are combined to form the color saliency map. The color feature of each image block can be computed via:

$$ F_{2}(b)=\sqrt{\left(\overline A(b)-\overline A(I)\right)^{2}+\left(\overline B(b)-\overline B(I)\right)^{2}} $$
(7)

where \(\overline {A}\) and \(\overline {B}\) represent the average A and B color component in LAB color space, respectively.

2.3 Center prior

When humans watch a picture, they will naturally gaze on the objects next to the center of it [11]. That is because the photographer usually centers the object of interest. As a result, human fixation will therefore unconsciously start at the central part of image. Thus, in order to obtain the salient objects conform with the human visual principle, more weights need to be added to the center of image region. Here we use a feature (denoted as f c (b)) to indicate the distance between each image block and image center. For each image block, the contrast and color saliency value F n (b),n=1,2 can be recalculated via:

$$ F_{n}^{\ast}(b)=F_{n}(b)f_{c}(b) $$
(8)
$$ f_{c}(b)=1-\frac{\sqrt{(r(b)-C^{\ast})^{2}+{(c(b)-R^{\ast})}^{2}}}{\sqrt{(M/2)^{2}+(N/2)^{2}}} $$
(9)

where r(b) and c(b) represent the upper-left coordinate of image block b, C and R represent the center coordinate of the estimated image region, M and N denote the width and height of the image region, respectively. The color and contrast saliency maps after center prior are shown in Fig. 2.

Finally, the feature map F n (I) of the whole image I is generated by combined the image feature \(F_{n}^{\ast }(b)\) of all the blocks. And the normalized feature map can be calculated by:

$$ \overline{F}_{n}(I)=\frac{F_{n}(I)-min(F_{n}(I))}{max(F_{n}(I)-min(F_{n}(I)))} $$
(10)

The extracted feature maps are shown as follows:

2.4 PCA based feature fusion

Since PCA can remove the redundant information, and can well combine the most important information of two different components. Thus, we uses PCA to fuse the color feature map and contrast feature map. The proposed hierarchical salient object detection model is shown in Fig. 3.

Fig. 2
figure 2

Resulting feature maps. a Input image, b color saliency map, c contrast saliency map, d center weight map, e color saliency map after center prior, f contrast saliency map after center prior

Fig. 3
figure 3

The hierarchical salient object detection model

Firstly, we construct the color feature map F 1(I) and contrast feature map F 2(I) into data matrixes (denoted as M n , n=1,2), respectively. Next, calculating the covariance matrix (denoted as C) of the two data matrixes M 1 and M 2. Then, computing the eigenvalues (denoted as λ 1 and λ 2) and the corresponding eigenvectors (denoted as ξ i and ζ i , i=1,2) of the covariancematrix C. And then, determining the weighting coefficients (denoted as ω i , i=1,2) via:

$$\omega_{i} = \left\{\begin{array}{ll} \xi_{i} / {\sum}_{i=1}^{2} \xi_{i} & \qquad \text{if~} \lambda_{1} > \lambda_{2}\\ \zeta_{i} / {\sum}_{i=1}^{2} \zeta_{i} & \qquad \text{else} \end{array} \right. $$

Finally, the resulting saliency map (denoted as F(I)) can be calculated by:

$$ F(I)=\sum\limits_{i=1}^{2} \omega_{i} M_{i} $$
(11)

In order to achieve a preferable visual effect, the proposed method is performed in three scales: %100, %50, %25, which can better suppress the background information. Ultimately, the generated fusion map is smoothed by a Gaussian filter (the template size is 10∗10, and σ is 2.5). The final filtered map is shown as follows:

3 Experimental results

The experiment was implemented on the MSRA database [13] which includes two parts: (i) Image set A, contains 20,000 images and the principle salient objects are labeled by three users, and (ii) Image set B, contains 5,000 images and the principle salient objects are labeled by nine users. The images used in the experiments are representative for a certain class of image types. Some of these images show fairly typical conditions, such as objects located at different image locations. Others are more intricate, containing complex scene, such as low-contrast object, clutter backgrounds, and low lighting condition.

The proposed model (FCC) was compared with the other seven state-of-the-art models, including Itti’s model (IT) [10], salient region detection and segmentation (AC) model [1], graph-based visual saliency (GB) model [8], saliency using natural statistics (SUN) model [22], non-parametric (NP) model [15], image signature (IS) model [9], and context-aware (CA) model [18]. The performance of these salient object detection models is shown in Fig. 4.

Fig. 4
figure 4

Saliency maps obtained from different saliency computational models. a testing images, b ground-truth labeled rectangles, c–i saliency maps obtained by the seven state-of-the-art models, respectively, j saliency maps obtained by the proposed method

As illustrated in Fig. 4, the AC [1] and SUN [22] models fail to detect the salient object from the complex background. The saliency maps generated by IT [10] and NP [15] models appear rather blurry, and are difficult to clearly identify the salient object. The GB model [8] can not prominently reflect the outline of the salient objects, and face difficulty in detecting the texture images. The IS model [9] can only obtain the low resolution results, and it can hardly distinguish the saliency information from the non-saliency information. The CA model [18] has pretty good performance; however the highlighted salient region contains too much background information. The saliency maps obtained by our proposed model have a uniform salient region similar to the labeled rectangle, and can achieve good performance in complex background.

Given the generated saliency map S(x,y), we set a threshold T to segment the salient objects.

$$F(x,y) = \left\{\begin{array}{ll} 1 & \qquad S(x,y) > T\\ 0 & \qquad S(x,y) \leq T \end{array} \right. $$
$$ T=\alpha \ast \frac{1}{M \ast N} \sum\limits_{x=0}^{M-1} \sum\limits_{y=0}^{N-1} S(x,y) $$
(12)

where N and M are the height and width of the saliency map, respectively. The parameter α is empirically set to 1.7 to achieve higher registration probability.

Objective performance evaluation is conducted by calculating the True Positive Rate (TPR) and the False Positive Rate (FPR). For the obtained saliency map F(x,y), a threshold t ( 0≤t≤1) is used to obtain the binary masks B t (x,y), in which 0 denote the background and 1 denote the salient objects. The TPR and FPR can be computed via:

$$ TPR=E\left(\prod B_{t}(x,y) \ast G(x,y)\right) $$
(13)
$$ FPR=E\left(\prod (1-B_{t}(x,y)) \ast G(x,y)\right) $$
(14)

Table 1 shows the results of TPR and FPR obtained from these models. As illustrated in Table 1 and Fig. 5, the proposed model can achieve the best performance over the other seven models.

Fig. 5
figure 5

ROC curves of salient object detection results obtained by different salient object detection models

Table 1 The TPR and FPR results obtained by different salient object detection approaches

Finally, we compare the computational complexity of the different salient object detection models discussed. For this purpose, these models are implemented using the Matlab programming language and run on a PC with a G2020 CPU and a 4GB RAM. Each of the models is applied to the MSRA database, and then their respective average execution times can be obtained (given in seconds in Table 2).

Table 2 The run-time performance (in seconds) comparison of different salient object detection models

4 Conclusions

The proposed algorithm is based on PCA method to fuse the color feature and contrast feature of the original image. The generated saliency map can highlight the salient object in different images. Experiments are conducted on the MSRA dataset to compare the performance of the proposed model with other seven state-of-the-art salient object detection models. The proposed model can achieve fairly good performance against the state-of-the-art saliency computational models, as verified in extensive experiments.