1 Introduction

Images have become a major source for information. Humans routinely and effortlessly judge the importance of image regions, and focus attention on important parts. This ability, viewed as visual attention, plays a significant role in analyzing and extracting importance visual information in engineering applications. Let the computer has the ability to capture salient information is a hot research topic. A profound challenge in computer vision is to make the computer understand the surrounding scene via image, while three vital tasks of this field are focusing on perceiving the key object, detecting the shapes and contours, and capturing the context information. To achieve this goal, obtaining the visual salience information from the image is the most basic and important step. Correctly extracting these salient regions can improve the efficiency and accuracy in image analysis and processing, reduce the complexity of calculations, allow preferential allocation of computational resources, break down the barrier between the content understanding and underlying characteristics, and make a higher level understanding of the image possible. Saliency maps are widely used in many computer vision and pattern recognition applications, such as image retrieval [7, 27], image segmentation [5, 19], object detection [6], image compression [21] and image fusion [3]. Exhaustive study of saliency detection has a far-reaching significance in improving the performance of image understanding and image analysis systems, as well as in enhancing the level of application of image processing technologies.

Human visual system scans the scene both in a rapid, bottom-up, task-independent and data driven saliency extraction as well as in a slower, top-down, task-dependent and goal driven saliency extraction. So far, most research papers estimate visual saliency in a bottom-up way for its task-independent characteristic. Over the past decades, many excellent methods have been proposed to detection visual saliency. Based on the highly influential biological inspired early representation model introduced by Koch and Ullman [12], Itti et al. [9] proposed one of the earliest works in visual saliency detection. They applied central-surrounded differences across multi-scale and multi-features visual space to obtain saliency map, and found the salient regions from the saliency map by the saliency values from strong to weak, and then represented the salient regions by a fixed radius circle. Ma and Zhang [15] incorporated a fuzzy growth technique in the saliency method for detecting different levels of saliency. Zhai and Shah [24] proposed a visual saliency method based on single-channel histogram contrast. Combined with Fourier spectral residual, Hou and Zhang [8] detected the image salient regions based on image spectral. Based on Bayesian framework, Zhang et al. [25] analyzed the statistical information of natural image to detect the image saliency. Combined with information theory, Bruce et al. [4] calculated the image saliency by using information maximization. Achanta et al. [1] proposed a purely computational method. In order to adapt large salient objects regions, they proposed an improved method by adjusting the image spatial frequency domain [2]. Based on mathematical and statistical principles, Murray et al. [18] applied inverse wavelet transform to process image, and then to extract saliency map by using nonlinear weighted scale fusion. Xie et al. [23] proposed a novel method for bottom-up saliency within the Bayesian framework by exploiting low and mid level cues. These methods can fleetly find the objects regions of human interest. But most of these methods center on human eye fixation prediction or certain task of salient object detection, and do not have a strong expansibility, for example, these methods have a good performance on human eye fixation prediction, while the results of salient object detection may be not satisfying, and vice versa. Furthermore, by defining visual saliency at each location as the dissimilarity between itself and its local neighborhood or global counterparts, many state-of-the-art methods have to segment image into block or region for visual saliency detection. The quality of image segmentation directly affects saliency extraction and method efficiency.

We propose an effective visual saliency detection method based on multi-scale and multi-channel mean to improve the definition and accuracy of the salient detection. 2-D wavelet transform is used to decompose and reconstruct image, where can effectively filter the background information of image and highlight salient regions without image segmentation. The bicubic interpolation algorithm is used to narrow the filtered image in multi-scale. We take the distances between the narrowed images and the means of their channels as saliency values, where can avoid tending to produce higher saliency values near edges instead of uniformly highlighting salient objects. In order to filter the background noises of saliency maps, we only reserve part values which are not less than the mean saliency of a given image. Bicubic interpolation algorithm is used again to amplify the images in multi-scale, and then the saliency map is calculated by adding the amplified images. Finally, linear normalization is employed to obtain the final saliency map. We provide an objective comparison of the saliency maps against 9 state-of-the-art methods. Our method outperforms all of these methods in terms of definition and accuracy.

2 Relevant theories

2.1 Color space conversion

With the high correlation between the three components of RGB color space, and there is no direct interrelation with intuitional color concepts, such as hue, saturation and brightness, we would better not process these components directly. As a color model for the human eyes to distinguish compatibly, the model of HSV color space is with good perception characteristic and the ability to easily convert to the model of RGB color space, while it well reflects the human feelings for color and makes against image processing [22, 26]. As the widely used in image saliency detection of HSV color space model, we detect image saliency in HSV color space. Using the following formulas to transform RGB color space to HSV color space:

$$ H=\left\{\begin{array}{c}\hfill \theta, G\ge B\hfill \\ {}\hfill 2\pi -\theta, G<B\hfill \end{array}\right. $$
(1)
$$ S=1-\frac{3}{\left(R+G+B\right)} min\left(R,G,B\right) $$
(2)
$$ V=\frac{1}{\sqrt{3}}\left(R+G+B\right) $$
(3)

Where hue value H is basic pure color, saturation S is the ratio of white light doped into color, brightness value V is the ratio of black light doped into color, R, G, B ∈ [0, 1] and \( \theta =co{s}^{-1}\left(\frac{\left(R-G\right)+\left(R-B\right)}{2\sqrt{{\left(R-G\right)}^2+\left(R-B\right)\left(G-B\right)}}\right) \).

2.2 2-D wavelet transform

Multi-scale analysis was introduced into wavelet analysis by Mallat in 1989. He proposed the concept of multi-resolution analysis, and gave a general method for constructing orthogonal wavelet basis and a fast wavelet algorithm, viewed as Mallat algorithm, which relatives to Fast Fourier Transform [16, 17]. Wavelet transform can be well matched human visual system. Low frequency coefficients of wavelet blocks correspond to the average luminance of image blocks, where large coefficients represent the high average luminance of the image regions, and small coefficients represent the small average luminance of the image regions. High frequency coefficients represent the texture and the edge portion of the image, where large absolute values of the coefficients represent the complex texture and edge portion of the image, small absolute values of the coefficients represent the smooth part of the image. Therefore, we separate the low and the high frequency of the image by 2-D wavelet decomposition, highlight the low frequency components and attenuate the high frequency components, and then we reuse 2-D wavelet transform for image reconstruction, after these, the salient region can be enhanced and the noises of the background are filtered. We consider {c j + 1 k,m } as a 2-D image, where j is resolution ratio, k and m are row index and column index respectively. Then the decomposition and reconstruction algorithm for 2-D wavelet transform are as follows:

$$ \left\{\begin{array}{c}\hfill \begin{array}{c}\hfill {c}_{k,m}^j={\displaystyle \sum_{l,n}}{\tilde{h}}_{l-2k}{\tilde{h}}_{n-2m}{c}_{l,n}^{j+1}\hfill \\ {}\hfill {d}_{k,m}^{j,1}={\displaystyle \sum_{l,n}}{\tilde{h}}_{l-2k}{\tilde{g}}_{n-2m}{c}_{l,n}^{j+1}\hfill \\ {}\hfill {d}_{k,m}^{j,2}={\displaystyle \sum_{l,n}}{\tilde{g}}_{l-2k}{\tilde{h}}_{n-2m}{c}_{l,n}^{j+1}\hfill \end{array}\hfill \\ {}\hfill {d}_{k,m}^{j,3}={\displaystyle \sum_{l,n}}{\tilde{g}}_{l-2k}{\tilde{g}}_{n-2m}{c}_{l,n}^{j+1}\hfill \end{array}\right. $$
(4)
$$ \begin{array}{c}\hfill {c}_{k,m}^{j+1}={\displaystyle \sum_{l,n}}{h}_{k-2l}{h}_{m-2n}{c}_{l,n}^j+{\displaystyle \sum_{l,n}}{h}_{k-2l}{g}_{m-2n}{d}_{l,n}^{j,1}+\hfill \\ {}\hfill {\displaystyle \sum_{l,n}}{g}_{k-2l}{h}_{m-2n}{d}_{l,n}^{j,2}+{\displaystyle \sum_{l,n}}{g}_{k-2l}{g}_{m-2n}{d}_{l,n}^{j,3}\hfill \end{array} $$
(5)

Where \( \tilde{h},\tilde{g},h\ \mathrm{and}\ g \) are biorthogonal filter banks, c j k,m , d j,1 k,m , d j,2 k,m and d j,3 k,m are low frequency coefficient, horizontal high frequency coefficient, vertical high frequency coefficient and diagonal high frequency coefficient respectively. Filter banks of 2-D wavelet transform is illustrated in Fig. 1.

Fig. 1
figure 1

Filter banks of 2-D wavelet transform (a) 2-D wavelet decomposition (b) 2-D wavelet reconstruction

2.3 Bicubic interpolation

Multi-scale analysis is conventional and useful for visual saliency detection, and it is widely used in many literatures [9, 13, 14]. For a same object, attention on small scale image focuses on a whole object, while attention on large scale image cares more about the local details. Small scale image makes an integrate object more continuous and conspicuous. Bicubic interpolation [11] is for narrowing and amplifying images at different scales. Bicubic interpolation algorithm retains more image detail and is able to get a relatively clear picture quality in image scaling.

The process of bicubic interpolation is to create a continuous and simple analytical model based on the known observation pixel points. Bicubic interpolation is cubic interpolation in two dimensions. It obtains target pixel value f(i + m, j + n) by calculating the weighted mean of the neighbor 4 × 4 matrix of the float coordinate (i + m, j + n),where i and j represent the column and row of pixels, m and n represent integer floating points between -1 and 2. Interpolation basic function is the foundation for bicubic interpolation. The definition of the basic function u(s) is as follows:

$$ u(s)=\left\{\begin{array}{c}\hfill \frac{3}{2}\left|s\right|{}^3-\frac{5}{2}\left|s\right|{}^2+1\kern0.5em ,\kern0.5em \left|s\right|<1\hfill \\ {}\hfill -\frac{1}{2}\left|s\right|{}^3+\frac{5}{2}\left|s\right|{}^2-4\left|s\right|+2\kern0.5em ,\kern0.5em 1\le \left|s\right|<2\hfill \\ {}\hfill 0\kern0.5em ,\kern0.5em 2\le \left|s\right|.\hfill \end{array}\right. $$
(6)

As illustrated in Fig. 2, u(s) is approaching cubic interpolation curve sin(s * π)/(s * π).

Fig. 2
figure 2

Cubic interpolation curve and its approximation curve

f(i + m, j + n) can be obtained by the following interpolation formula:

$$ f\left(i+m,j+n\right) = {\left[\begin{array}{c}\hfill \begin{array}{c}\hfill u\left(m+1\right)\hfill \\ {}\hfill u(m)\hfill \end{array}\hfill \\ {}\hfill u\left(m-1\right)\hfill \\ {}\hfill u\left(m-2\right)\hfill \end{array}\right]}^T\left[\begin{array}{cc}\hfill \begin{array}{c}\hfill \begin{array}{c}\hfill f\left(i-1,j-1\right)\hfill \\ {}\hfill f\left(i,j-1\right)\hfill \end{array}\hfill \\ {}\hfill f\left(i+1,j-1\right)\hfill \\ {}\hfill f\left(i+2,j-1\right)\hfill \end{array}\hfill & \hfill \begin{array}{ccc}\hfill \begin{array}{c}\hfill \begin{array}{c}\hfill f\left(i-1,j\right)\hfill \\ {}\hfill f\left(i,j\right)\hfill \\ {}\hfill f\left(i+1,j\right)\hfill \end{array}\hfill \\ {}\hfill f\left(i+2,j\right)\hfill \end{array}\hfill & \hfill \begin{array}{c}\hfill \begin{array}{c}\hfill f\left(i-1,j+1\right)\hfill \\ {}\hfill f\left(i,j+1\right)\hfill \end{array}\hfill \\ {}\hfill f\left(i+1,j+1\right)\hfill \\ {}\hfill f\left(i+2,j+1\right)\hfill \end{array}\hfill & \hfill \begin{array}{c}\hfill \begin{array}{c}\hfill f\left(i-1,j+2\right)\hfill \\ {}\hfill f\left(i,j+2\right)\hfill \\ {}\hfill f\left(i+1,j+2\right)\hfill \end{array}\hfill \\ {}\hfill f\left(i+2,j+2\right)\hfill \end{array}\hfill \end{array}\hfill \end{array}\right]\left[\begin{array}{c}\hfill \begin{array}{c}\hfill u\left(n+1\right)\hfill \\ {}\hfill u(n)\hfill \\ {}\hfill u\left(n-1\right)\hfill \end{array}\hfill \\ {}\hfill u\left(n-2\right)\hfill \end{array}\right] $$
(7)

3 Visual saliency detection based on multi-scale and multi-channel mean

Many researchers have proposed methods of saliency, which always center on human eye fixation or certain task of salient object detection, and invariably require image segmentation. These often lead to a weak expansibility, and affect saliency extraction and algorithm efficiency. Aiming at these problems, we propose an effective visual saliency detection method based on multi-scale and multi-channel mean. This method includes four parts: image filtering, resize image, calculate saliency values and generate saliency map. The framework of the proposed method is shown in Fig. 3.

Fig. 3
figure 3

Framework of the visual saliency detection based on multi-scale and multi-channel mean

3.1 Image filtering

Human visual physiological characteristics determine the sensitivity of the low frequency signal is greater than the sensitivity of the high frequency signal of an image. The low frequency coefficients describe the main energy part of the image and the high frequency coefficients describe the details. Therefore, we highlight the low frequency components and attenuate the high frequency components by 2-D wavelet transform in HSV color space. For efficiency and simplicity, we consider three level 2-D wavelet transform for image decomposition and reconstruction. The tower structure of three-level 2-D wavelet transform is illustrated in Fig. 4.

Fig. 4
figure 4

Tower structure of three-level 2-D wavelet transform

Haar wavelet was proposed by Alfred Haar in 1909. It is the simplest transform in wavelet transform. In addition, it is the only orthogonal wavelet having symmetry and compact support [10, 20]. Therefore, we construct filter banks based on Haar wavelet. Filter function of Haar wavelet is as follows:

$$ {h}_k=\left\{\begin{array}{c}\hfill \frac{1}{\sqrt{2}}\kern1em if\ k=0,1\hfill \\ {}\hfill 0\kern1em otherwise\hfill \end{array}\right. $$
(8)

Where h k is a real filter. Combined with Section 2.2, we obtain g k  = (−1)k h 1 − k , \( {\tilde{h}}_k={h}_{k\ }\mathrm{and}\kern0.5em {\tilde{g}}_k={g}_k \). High frequency signal image and low frequency signal image after 2-D Haar wavelet decomposition and reconstruction are shown in Fig. 5(c) and (d) respectively. The filtered image is obtained by

Fig. 5
figure 5

a Original image (b) The original image in HSV color space (c) High frequency signal image (d) Low frequency signal image (e) Filtered image

$$ I=\alpha {I}_H+\beta {I}_L $$
(9)

Where I H and I L are high frequency signal image and low frequency signal image respectively, α and β are weight parameters and we set them to be 0.9 and 1.1 respectively in our experiments. The filtered image I is the last picture in Fig. 5.

3.2 Resize image

Salient regions are associated with image scales. For a same object, attention on small scale image focuses on a whole object with same features, which raises the saliency in small scale regions, while attention on large scale image raises the saliency in large scale regions. Therefore, we resize image with bicubic interpolation, narrow image on filtered image to obtain images at different scales, and amplify saliency maps to obtain images at unified scale.

By doing wavelet decomposition, we obtain series sub-images of different resolutions, while each sub-image is 1/4 size of the filtered image. After doing three-level 2-D wavelet transform, we get 1/4, 1/16 and 1/64 size of the filtered image respectively. After many experiments, we found that a better experimental result can be obtained on four scales of the filtered image I. We consider four different scales with scale factors β 1 = 1, β 2 = 1/4, β 3 = 1/16 and β 4 = 1/64, which means the same operations are implemented on the four scale image. The filtered image I is narrowed by bicubic interpolation algorithm on four scales, and the narrowed results are illustrated in Fig. 6.

Fig. 6
figure 6

Resized images. (a) 1 size of the filtered image (b) 1/4 size of the filtered image (c) 1/16 size of the filtered image (d) 1/64 size of the filtered image

3.3 Calculate saliency values

In order to avoid saliency computation process tending to produce higher saliency values near edges instead of uniformly highlighting salient objects, and ignoring spatial relationships across image parts, we calculate pixels means in each channel in HSV color space and take the distances between the images and the mean of their channels as saliency values. Our method of finding the saliency map SV n for an image can be formulated as:

$$ S{V}_n={\left({H}_n-H\_Mea{n}_n\right)}^2+{\left({S}_n-S\_Mea{n}_n\right)}^2+{\left({V}_n-V\_Mea{n}_n\right)}^2 $$
(10)

Where H n , S n and V n are H, S, V channel of the filtered image in HSV color space respectively, H _ Mean n , S _ Mean n and V _ Mean n are the mean of each channel respectively. Multi-Scale saliency maps images are illustrated in Fig. 7.

Fig. 7
figure 7

Saliency maps in multi-scales. a 1 size of the saliency map (b) 1/4 size of the saliency map (c) 1/16 size of the saliency map (d) 1/64 size of the saliency map

From Fig. 7 we can clearly see that there are too many background noises in multi-scale saliency maps. In order to filter the background noises of saliency maps, highlight salient regions, and avoid poor image definition, we only reserve part values which are not less than the mean saliency of a given image. Combined with Section 2.3 and Section 3.2, amplify saliency maps with bicubic interpolation algorithm to the size of original image. Multi-Scale saliency maps and their corresponding amplified images are illustrated in Fig. 8.

Fig. 8
figure 8

Saliency maps in multi-scales and their corresponding amplified images. a 1 size of the saliency map (b) 1/4 size of the saliency map (c) 1/16 size of the saliency map (d) 1/64 size of the saliency map (e) Amplified image corresponding to 1 size of the saliency map (f) Amplified image corresponding to 1/4 size of the saliency map (g) Amplified image corresponding to 1/16 size of the saliency map (h) Amplified image corresponding to 1/64 size of the saliency map

3.4 Generate saliency map

We obtain the synthetic saliency map CS by a simple linear addition of 4 amplified saliency maps. The linear model is used to normalize CS to find the saliency map SM 1. The final saliency map SM 1 is defined as,

$$ S{M}_1=255*\frac{CS-CS(min)}{CS(max)-CS(min)} $$
(11)

Where CS(max) and CS(min) are the maximum and the minimum values of pixels in CS.

Combined with Section 4.1, we can obtain different AUC scores by different values of δ based on detection framework in Fig. 3, where δ(δϵ(0, 1]) is the multiple of the original image. As shown in Fig. 9, we can obtain higher AUC scores when δ = 0.6 and δ = 1.

Fig. 9
figure 9

AUC scores for δ values

Considering the close correlation between saliency map and its multi-scale maps, we take 0.6 sizes of the original image and the original image into account, generate saliency maps based on detection framework in Fig. 3, denoted by SM 0.6 and SM 1 shown in Fig. 10(a) and (b) respectively. The final integrated saliency map SM (Fig. 10(c)) is obtained by

Fig. 10
figure 10

Generated saliency maps. a Generated saliency map corresponding to 0.6 sizes of the original image (b) Generated saliency map corresponding to 1 size of the original image (c) Final integrated saliency map

$$ SM=\gamma S{M}_{0.6}+\left(1-\gamma \right)S{M}_1 $$
(12)

Where γ is a weight parameter, γ ∈ [0, 1]. AUC scores for γ values are illustrated in Fig. 11. We set γ to be 0.45 in our experiments.

Fig. 11
figure 11

AUC scores for γ values

4 Experiments

We present empirical evaluation and analyze of the proposed method against the 9 state-of-the-art methods on the MSRA salient object database with the labeled ground truths [1], which contains 1000 color images with accurate pixel-wise object-contour segmentations. All the programs are operated by Windows 7, AMD Athlon(tm) X2 Dual-Core QL-64 2.1 GHz and MATLAB R2011b.

4.1 Evaluation standards

For comparing the quality of different saliency maps, we utilize a widely used method, the receiver operating characteristics curve (ROC Curve) [4, 18, 25]. Meanwhile, average values of Precision, Recall, and F-Measure [1, 2, 23] are obtained to measure the performance of different saliency methods.

Given a saliency map and a labeled ground truth data, the true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) can be calculated as follows:

$$ \left\{\begin{array}{c}\hfill \begin{array}{c}\hfill TP={\displaystyle \sum_i^{N_{is}}}f\left(S{M}_i,t\right)B{M}_i\hfill \\ {}\hfill FP={\displaystyle \sum_i^{N_{is}}}f\left(S{M}_i,t\right)\left(1-B{M}_i\right)\hfill \\ {}\hfill TN={\displaystyle \sum_i^{N_{is}}}f\left(t,S{M}_i\right)\left(1-B{M}_i\right)\hfill \end{array}\hfill \\ {}\hfill FN={\displaystyle \sum_i^{N_{is}}}f\left(t,S{M}_i\right)B{M}_i\hfill \end{array}\right. $$
(13)

Where N is is the total number of pixels in SM, SM i is the pixel in SM, t is the threshold for binarization, t ∈ [0, 255], BM i is the binary mask, and function f(n 1, n 2 ) is defined as,

$$ f\left({n}_1,{n}_2\right)=\left\{\begin{array}{c}\hfill 1,{n}_1\ge {n}_2\hfill \\ {}\hfill 0,{n}_1<{n}_2\hfill \end{array}\right. $$
(14)

Correspondingly, the true positive rate (TPR) and the false positive rate (FPR) are calculated as,

$$ \left\{\begin{array}{c}\hfill TPR=\frac{TP}{TP+FN}\hfill \\ {}\hfill FPR=\frac{FP}{FP+TN}\hfill \end{array}\right. $$
(15)

By varying the threshold t from 0 to 255, furthermore, the ROC Curve for the saliency model is plotted as the mean FPR versus mean TPR. ROC Curve is a composite indicator reflects the sensitivity (corresponding TPR) and 1-specificity (corresponding FPR) of continuous variables, and it is the most prevalent criteria for evaluating the performance of visual saliency detection methods. The area under the ROC Curve (AUC) can be used as statistical standard of quantitative experiment results. The higher AUC is, the higher the accuracy of the method is.

Analogously, the Precision and the Recall are defined as,

$$ \left\{\begin{array}{c}\hfill Precison=\frac{TP}{TP+FP}\hfill \\ {}\hfill Recall=\frac{TP}{TP+FN}\hfill \end{array}\right. $$
(16)

We vary the threshold from 0 to 255 on a given saliency map with saliency values in the range [0,255], and calculate Precision and Recall at each value of the threshold, and then compute the average values of Precision and Recall. Average value of F-Measure is obtained over the same labeled ground truths.

$$ {F}_{\beta }=\frac{\left(1+{\beta}^2\right)\times P\times R}{\beta^2\times P+R} $$
(17)

Where P is the average value of Precision, R is the average value of Recall. We use β 2 = 0.3 to weigh Precision more than Recall as suggested in [1, 2, 23]. The higher Precision, Recall and F-Measure are, the better the performance of the method is.

4.2 Experimental results and analyses

We evaluate the proposed method with several state-of-the-art methods: IK [9], MZ [15], ZS [24], HZ [8], ZT [25], BT [4], AS [2], MV [18] and XL [23]. Some sample results where brighter pixels indicate higher saliency probabilities are illustrated in Fig. 12. The IT [9] method is generated in low resolution and tends to highlight the boundaries and assign relatively low probabilities to pixels inside the objects, and extracts only small parts of salient objects. The MZ [15], HZ [8] and ZT [25] methods care more about local abrupt changes so they only can capture edges of objects. The ZS [24], AS [2] and MV [18] methods pay attention to the whole regions of salient objects, however, they either miss large parts of salient objects, or produce unreasonable or diffuse maps. The BT [4] and XL [23] methods are able to locate the whole salient objects, but their results involve a lot of background details. Our method not only considers global and local saliency, but also remains edges of salient objects, and we filter a lot of background details and make higher image definition. Overall, the saliency maps of our method are much more closely similar to the labeled ground truths.

Fig. 12
figure 12

Visual comparison of our saliency maps with 9 state-of-the-art methods. a Original images (b) IK [9] (c) MZ [15] (d) ZS [24] (e) HZ [8] (f) ZT [25] (g) BT [4] (h) AS [2] (i) MV [18] (j) XL [23] (k) Proposed (l) Labeled ground truths

The ROC Curves of various saliency methods and corresponding AUC bars are shown in Figs. 13 and 14 respectively. The maximum sensitivities, the maximum 1-specificities, and corresponding AUC scores are given in Table 1. As shown in Figs. 13, 14, and Table 1, we achieve the clearest relationship between sensitivities and 1-specificities of saliency map and the highest AUC score 0.8630, maximum 1-specficitiy and maximum sensitivity being to 1 respectively. The proposed method performs better than the other 9 state-of-the-art methods, which indicates our saliency maps are more precision and effective to the salient regions. Although the sensitivities of the AS [2] and XL [23] methods in low 1-specificities are higher than the sensitivity of the proposed method, their maximum 1-specificities are 0.8472 and 0.7844 respectively, obviously, lower than the proposed method. The maximum sensitivity and 1-specificity of the ZT [25] and MV [18] methods are approaching the results of the proposed method, however, their ROC Curve curvature are obviously lower than the curvature of the proposed method. For these reasons, their AUC scores are lower than the AUC scores of the proposed method.

Fig. 13
figure 13

ROC curves

Fig. 14
figure 14

AUC bars

Table 1 Comparison the proposed method with 9 state-of-the-art methods

The Precision, Recall and F-Measure of a saliency map are averaged over 1000 images, and the results are shown in Fig. 15 and Table 1. AS [2] shows a high Precision but very poor Recall, indicating that it is better suited for gaze tracking, but perhaps not well suited for salient regions segmentation. XL [23] shows a high Recall but low Precision. Among all the methods, the proposed method achieves one of the best Precision with higher recall and the best F-Measure values. Overall, the proposed method not only enhances the definition of salient regions, but also improves the accuracy of visual saliency detection.

Fig. 15
figure 15

Mean Precision, Recall and F-Measure values of the evaluation standards

5 Summary

We propose an effective visual saliency detection method based on multi-scale and multi-channel mean. This method neither centers on human eye fixation prediction nor centers on certain task of salient object detection, nor requires segmenting image. We analyze frequency signals and color channels, detect salient in multi-scales. Based on MSRA image database and several evaluation standards, we demonstrate that the proposed method outperforms 9 state-of-the-art saliency methods. However, the accuracy of salient detection for complex textured background is not very high. Future work may be beneficial to incorporate high level factors like symmetry and semantic into saliency maps, while try to find out more effective physiological, psychological and computer vision models for salient detection.