Abstract
This paper proposes an effective method for visual saliency detection based on multi-scale and multi-channel mean. 2-D wavelet transform is used to decompose and reconstruct image. Bicubic interpolation algorithm is applied to narrow the filtered image in multi-scale. We take the distances between the narrowed images and the means of their channels as saliency values, and we only reserve part values which are not less than the mean saliency of the given image. Bicubic interpolation algorithm is applied again to amplify the images in multi-scale, and then the saliency map is calculated by adding the amplified images. Finally, linear normalization is employed to obtain the final saliency map. Experimental results show that the proposed method outperforms 9 state-of-the-art methods both on the definition and accuracy of salient detection.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Images have become a major source for information. Humans routinely and effortlessly judge the importance of image regions, and focus attention on important parts. This ability, viewed as visual attention, plays a significant role in analyzing and extracting importance visual information in engineering applications. Let the computer has the ability to capture salient information is a hot research topic. A profound challenge in computer vision is to make the computer understand the surrounding scene via image, while three vital tasks of this field are focusing on perceiving the key object, detecting the shapes and contours, and capturing the context information. To achieve this goal, obtaining the visual salience information from the image is the most basic and important step. Correctly extracting these salient regions can improve the efficiency and accuracy in image analysis and processing, reduce the complexity of calculations, allow preferential allocation of computational resources, break down the barrier between the content understanding and underlying characteristics, and make a higher level understanding of the image possible. Saliency maps are widely used in many computer vision and pattern recognition applications, such as image retrieval [7, 27], image segmentation [5, 19], object detection [6], image compression [21] and image fusion [3]. Exhaustive study of saliency detection has a far-reaching significance in improving the performance of image understanding and image analysis systems, as well as in enhancing the level of application of image processing technologies.
Human visual system scans the scene both in a rapid, bottom-up, task-independent and data driven saliency extraction as well as in a slower, top-down, task-dependent and goal driven saliency extraction. So far, most research papers estimate visual saliency in a bottom-up way for its task-independent characteristic. Over the past decades, many excellent methods have been proposed to detection visual saliency. Based on the highly influential biological inspired early representation model introduced by Koch and Ullman [12], Itti et al. [9] proposed one of the earliest works in visual saliency detection. They applied central-surrounded differences across multi-scale and multi-features visual space to obtain saliency map, and found the salient regions from the saliency map by the saliency values from strong to weak, and then represented the salient regions by a fixed radius circle. Ma and Zhang [15] incorporated a fuzzy growth technique in the saliency method for detecting different levels of saliency. Zhai and Shah [24] proposed a visual saliency method based on single-channel histogram contrast. Combined with Fourier spectral residual, Hou and Zhang [8] detected the image salient regions based on image spectral. Based on Bayesian framework, Zhang et al. [25] analyzed the statistical information of natural image to detect the image saliency. Combined with information theory, Bruce et al. [4] calculated the image saliency by using information maximization. Achanta et al. [1] proposed a purely computational method. In order to adapt large salient objects regions, they proposed an improved method by adjusting the image spatial frequency domain [2]. Based on mathematical and statistical principles, Murray et al. [18] applied inverse wavelet transform to process image, and then to extract saliency map by using nonlinear weighted scale fusion. Xie et al. [23] proposed a novel method for bottom-up saliency within the Bayesian framework by exploiting low and mid level cues. These methods can fleetly find the objects regions of human interest. But most of these methods center on human eye fixation prediction or certain task of salient object detection, and do not have a strong expansibility, for example, these methods have a good performance on human eye fixation prediction, while the results of salient object detection may be not satisfying, and vice versa. Furthermore, by defining visual saliency at each location as the dissimilarity between itself and its local neighborhood or global counterparts, many state-of-the-art methods have to segment image into block or region for visual saliency detection. The quality of image segmentation directly affects saliency extraction and method efficiency.
We propose an effective visual saliency detection method based on multi-scale and multi-channel mean to improve the definition and accuracy of the salient detection. 2-D wavelet transform is used to decompose and reconstruct image, where can effectively filter the background information of image and highlight salient regions without image segmentation. The bicubic interpolation algorithm is used to narrow the filtered image in multi-scale. We take the distances between the narrowed images and the means of their channels as saliency values, where can avoid tending to produce higher saliency values near edges instead of uniformly highlighting salient objects. In order to filter the background noises of saliency maps, we only reserve part values which are not less than the mean saliency of a given image. Bicubic interpolation algorithm is used again to amplify the images in multi-scale, and then the saliency map is calculated by adding the amplified images. Finally, linear normalization is employed to obtain the final saliency map. We provide an objective comparison of the saliency maps against 9 state-of-the-art methods. Our method outperforms all of these methods in terms of definition and accuracy.
2 Relevant theories
2.1 Color space conversion
With the high correlation between the three components of RGB color space, and there is no direct interrelation with intuitional color concepts, such as hue, saturation and brightness, we would better not process these components directly. As a color model for the human eyes to distinguish compatibly, the model of HSV color space is with good perception characteristic and the ability to easily convert to the model of RGB color space, while it well reflects the human feelings for color and makes against image processing [22, 26]. As the widely used in image saliency detection of HSV color space model, we detect image saliency in HSV color space. Using the following formulas to transform RGB color space to HSV color space:
Where hue value H is basic pure color, saturation S is the ratio of white light doped into color, brightness value V is the ratio of black light doped into color, R, G, B ∈ [0, 1] and \( \theta =co{s}^{-1}\left(\frac{\left(R-G\right)+\left(R-B\right)}{2\sqrt{{\left(R-G\right)}^2+\left(R-B\right)\left(G-B\right)}}\right) \).
2.2 2-D wavelet transform
Multi-scale analysis was introduced into wavelet analysis by Mallat in 1989. He proposed the concept of multi-resolution analysis, and gave a general method for constructing orthogonal wavelet basis and a fast wavelet algorithm, viewed as Mallat algorithm, which relatives to Fast Fourier Transform [16, 17]. Wavelet transform can be well matched human visual system. Low frequency coefficients of wavelet blocks correspond to the average luminance of image blocks, where large coefficients represent the high average luminance of the image regions, and small coefficients represent the small average luminance of the image regions. High frequency coefficients represent the texture and the edge portion of the image, where large absolute values of the coefficients represent the complex texture and edge portion of the image, small absolute values of the coefficients represent the smooth part of the image. Therefore, we separate the low and the high frequency of the image by 2-D wavelet decomposition, highlight the low frequency components and attenuate the high frequency components, and then we reuse 2-D wavelet transform for image reconstruction, after these, the salient region can be enhanced and the noises of the background are filtered. We consider {c j + 1 k,m } as a 2-D image, where j is resolution ratio, k and m are row index and column index respectively. Then the decomposition and reconstruction algorithm for 2-D wavelet transform are as follows:
Where \( \tilde{h},\tilde{g},h\ \mathrm{and}\ g \) are biorthogonal filter banks, c j k,m , d j,1 k,m , d j,2 k,m and d j,3 k,m are low frequency coefficient, horizontal high frequency coefficient, vertical high frequency coefficient and diagonal high frequency coefficient respectively. Filter banks of 2-D wavelet transform is illustrated in Fig. 1.
2.3 Bicubic interpolation
Multi-scale analysis is conventional and useful for visual saliency detection, and it is widely used in many literatures [9, 13, 14]. For a same object, attention on small scale image focuses on a whole object, while attention on large scale image cares more about the local details. Small scale image makes an integrate object more continuous and conspicuous. Bicubic interpolation [11] is for narrowing and amplifying images at different scales. Bicubic interpolation algorithm retains more image detail and is able to get a relatively clear picture quality in image scaling.
The process of bicubic interpolation is to create a continuous and simple analytical model based on the known observation pixel points. Bicubic interpolation is cubic interpolation in two dimensions. It obtains target pixel value f(i + m, j + n) by calculating the weighted mean of the neighbor 4 × 4 matrix of the float coordinate (i + m, j + n),where i and j represent the column and row of pixels, m and n represent integer floating points between -1 and 2. Interpolation basic function is the foundation for bicubic interpolation. The definition of the basic function u(s) is as follows:
As illustrated in Fig. 2, u(s) is approaching cubic interpolation curve sin(s * π)/(s * π).
f(i + m, j + n) can be obtained by the following interpolation formula:
3 Visual saliency detection based on multi-scale and multi-channel mean
Many researchers have proposed methods of saliency, which always center on human eye fixation or certain task of salient object detection, and invariably require image segmentation. These often lead to a weak expansibility, and affect saliency extraction and algorithm efficiency. Aiming at these problems, we propose an effective visual saliency detection method based on multi-scale and multi-channel mean. This method includes four parts: image filtering, resize image, calculate saliency values and generate saliency map. The framework of the proposed method is shown in Fig. 3.
3.1 Image filtering
Human visual physiological characteristics determine the sensitivity of the low frequency signal is greater than the sensitivity of the high frequency signal of an image. The low frequency coefficients describe the main energy part of the image and the high frequency coefficients describe the details. Therefore, we highlight the low frequency components and attenuate the high frequency components by 2-D wavelet transform in HSV color space. For efficiency and simplicity, we consider three level 2-D wavelet transform for image decomposition and reconstruction. The tower structure of three-level 2-D wavelet transform is illustrated in Fig. 4.
Haar wavelet was proposed by Alfred Haar in 1909. It is the simplest transform in wavelet transform. In addition, it is the only orthogonal wavelet having symmetry and compact support [10, 20]. Therefore, we construct filter banks based on Haar wavelet. Filter function of Haar wavelet is as follows:
Where h k is a real filter. Combined with Section 2.2, we obtain g k = (−1)k h 1 − k , \( {\tilde{h}}_k={h}_{k\ }\mathrm{and}\kern0.5em {\tilde{g}}_k={g}_k \). High frequency signal image and low frequency signal image after 2-D Haar wavelet decomposition and reconstruction are shown in Fig. 5(c) and (d) respectively. The filtered image is obtained by
Where I H and I L are high frequency signal image and low frequency signal image respectively, α and β are weight parameters and we set them to be 0.9 and 1.1 respectively in our experiments. The filtered image I is the last picture in Fig. 5.
3.2 Resize image
Salient regions are associated with image scales. For a same object, attention on small scale image focuses on a whole object with same features, which raises the saliency in small scale regions, while attention on large scale image raises the saliency in large scale regions. Therefore, we resize image with bicubic interpolation, narrow image on filtered image to obtain images at different scales, and amplify saliency maps to obtain images at unified scale.
By doing wavelet decomposition, we obtain series sub-images of different resolutions, while each sub-image is 1/4 size of the filtered image. After doing three-level 2-D wavelet transform, we get 1/4, 1/16 and 1/64 size of the filtered image respectively. After many experiments, we found that a better experimental result can be obtained on four scales of the filtered image I. We consider four different scales with scale factors β 1 = 1, β 2 = 1/4, β 3 = 1/16 and β 4 = 1/64, which means the same operations are implemented on the four scale image. The filtered image I is narrowed by bicubic interpolation algorithm on four scales, and the narrowed results are illustrated in Fig. 6.
3.3 Calculate saliency values
In order to avoid saliency computation process tending to produce higher saliency values near edges instead of uniformly highlighting salient objects, and ignoring spatial relationships across image parts, we calculate pixels means in each channel in HSV color space and take the distances between the images and the mean of their channels as saliency values. Our method of finding the saliency map SV n for an image can be formulated as:
Where H n , S n and V n are H, S, V channel of the filtered image in HSV color space respectively, H _ Mean n , S _ Mean n and V _ Mean n are the mean of each channel respectively. Multi-Scale saliency maps images are illustrated in Fig. 7.
From Fig. 7 we can clearly see that there are too many background noises in multi-scale saliency maps. In order to filter the background noises of saliency maps, highlight salient regions, and avoid poor image definition, we only reserve part values which are not less than the mean saliency of a given image. Combined with Section 2.3 and Section 3.2, amplify saliency maps with bicubic interpolation algorithm to the size of original image. Multi-Scale saliency maps and their corresponding amplified images are illustrated in Fig. 8.
3.4 Generate saliency map
We obtain the synthetic saliency map CS by a simple linear addition of 4 amplified saliency maps. The linear model is used to normalize CS to find the saliency map SM 1. The final saliency map SM 1 is defined as,
Where CS(max) and CS(min) are the maximum and the minimum values of pixels in CS.
Combined with Section 4.1, we can obtain different AUC scores by different values of δ based on detection framework in Fig. 3, where δ(δϵ(0, 1]) is the multiple of the original image. As shown in Fig. 9, we can obtain higher AUC scores when δ = 0.6 and δ = 1.
Considering the close correlation between saliency map and its multi-scale maps, we take 0.6 sizes of the original image and the original image into account, generate saliency maps based on detection framework in Fig. 3, denoted by SM 0.6 and SM 1 shown in Fig. 10(a) and (b) respectively. The final integrated saliency map SM (Fig. 10(c)) is obtained by
Where γ is a weight parameter, γ ∈ [0, 1]. AUC scores for γ values are illustrated in Fig. 11. We set γ to be 0.45 in our experiments.
4 Experiments
We present empirical evaluation and analyze of the proposed method against the 9 state-of-the-art methods on the MSRA salient object database with the labeled ground truths [1], which contains 1000 color images with accurate pixel-wise object-contour segmentations. All the programs are operated by Windows 7, AMD Athlon(tm) X2 Dual-Core QL-64 2.1 GHz and MATLAB R2011b.
4.1 Evaluation standards
For comparing the quality of different saliency maps, we utilize a widely used method, the receiver operating characteristics curve (ROC Curve) [4, 18, 25]. Meanwhile, average values of Precision, Recall, and F-Measure [1, 2, 23] are obtained to measure the performance of different saliency methods.
Given a saliency map and a labeled ground truth data, the true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) can be calculated as follows:
Where N is is the total number of pixels in SM, SM i is the pixel in SM, t is the threshold for binarization, t ∈ [0, 255], BM i is the binary mask, and function f(n 1, n 2 ) is defined as,
Correspondingly, the true positive rate (TPR) and the false positive rate (FPR) are calculated as,
By varying the threshold t from 0 to 255, furthermore, the ROC Curve for the saliency model is plotted as the mean FPR versus mean TPR. ROC Curve is a composite indicator reflects the sensitivity (corresponding TPR) and 1-specificity (corresponding FPR) of continuous variables, and it is the most prevalent criteria for evaluating the performance of visual saliency detection methods. The area under the ROC Curve (AUC) can be used as statistical standard of quantitative experiment results. The higher AUC is, the higher the accuracy of the method is.
Analogously, the Precision and the Recall are defined as,
We vary the threshold from 0 to 255 on a given saliency map with saliency values in the range [0,255], and calculate Precision and Recall at each value of the threshold, and then compute the average values of Precision and Recall. Average value of F-Measure is obtained over the same labeled ground truths.
Where P is the average value of Precision, R is the average value of Recall. We use β 2 = 0.3 to weigh Precision more than Recall as suggested in [1, 2, 23]. The higher Precision, Recall and F-Measure are, the better the performance of the method is.
4.2 Experimental results and analyses
We evaluate the proposed method with several state-of-the-art methods: IK [9], MZ [15], ZS [24], HZ [8], ZT [25], BT [4], AS [2], MV [18] and XL [23]. Some sample results where brighter pixels indicate higher saliency probabilities are illustrated in Fig. 12. The IT [9] method is generated in low resolution and tends to highlight the boundaries and assign relatively low probabilities to pixels inside the objects, and extracts only small parts of salient objects. The MZ [15], HZ [8] and ZT [25] methods care more about local abrupt changes so they only can capture edges of objects. The ZS [24], AS [2] and MV [18] methods pay attention to the whole regions of salient objects, however, they either miss large parts of salient objects, or produce unreasonable or diffuse maps. The BT [4] and XL [23] methods are able to locate the whole salient objects, but their results involve a lot of background details. Our method not only considers global and local saliency, but also remains edges of salient objects, and we filter a lot of background details and make higher image definition. Overall, the saliency maps of our method are much more closely similar to the labeled ground truths.
The ROC Curves of various saliency methods and corresponding AUC bars are shown in Figs. 13 and 14 respectively. The maximum sensitivities, the maximum 1-specificities, and corresponding AUC scores are given in Table 1. As shown in Figs. 13, 14, and Table 1, we achieve the clearest relationship between sensitivities and 1-specificities of saliency map and the highest AUC score 0.8630, maximum 1-specficitiy and maximum sensitivity being to 1 respectively. The proposed method performs better than the other 9 state-of-the-art methods, which indicates our saliency maps are more precision and effective to the salient regions. Although the sensitivities of the AS [2] and XL [23] methods in low 1-specificities are higher than the sensitivity of the proposed method, their maximum 1-specificities are 0.8472 and 0.7844 respectively, obviously, lower than the proposed method. The maximum sensitivity and 1-specificity of the ZT [25] and MV [18] methods are approaching the results of the proposed method, however, their ROC Curve curvature are obviously lower than the curvature of the proposed method. For these reasons, their AUC scores are lower than the AUC scores of the proposed method.
The Precision, Recall and F-Measure of a saliency map are averaged over 1000 images, and the results are shown in Fig. 15 and Table 1. AS [2] shows a high Precision but very poor Recall, indicating that it is better suited for gaze tracking, but perhaps not well suited for salient regions segmentation. XL [23] shows a high Recall but low Precision. Among all the methods, the proposed method achieves one of the best Precision with higher recall and the best F-Measure values. Overall, the proposed method not only enhances the definition of salient regions, but also improves the accuracy of visual saliency detection.
5 Summary
We propose an effective visual saliency detection method based on multi-scale and multi-channel mean. This method neither centers on human eye fixation prediction nor centers on certain task of salient object detection, nor requires segmenting image. We analyze frequency signals and color channels, detect salient in multi-scales. Based on MSRA image database and several evaluation standards, we demonstrate that the proposed method outperforms 9 state-of-the-art saliency methods. However, the accuracy of salient detection for complex textured background is not very high. Future work may be beneficial to incorporate high level factors like symmetry and semantic into saliency maps, while try to find out more effective physiological, psychological and computer vision models for salient detection.
References
Achanta R, Hemami S, Estrada F, Süsstrunk S (2009) Frequency-tuned salient region detection. IEEE Conference on Computer Vision and Pattern Recognition, Miami, pp 1597–1604
Achanta R, Süsstrunk S (2010) Saliency detection using maximum symmetric surround. IEEE International Conference on Image Processing, Hong Kong, pp 2653–2656
Bhatnagar G, Wu QMJ (2011) An image fusion framework based on human visual system in framelet domain. Int J Wavelets Multiresolution Inf Process 10(1):1–30
Bruce ND, Tsotsos JK (2009) Saliency, attention, and visual search: an information theoretic approach. J Vis 9(3):1–24
Chang KY, Liu TL, Lai SH (2011) From co-saliency to co-segmentation: an efficient and fully unsupervised energy minimization model. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Providence, pp 2129–2136
Ding ZH, Yu Y, Wang B, Zhang LM (2012) An approach for visual attention based on biquaternion and its application for ship detection in multispectral imagery. Neurocomputing 76(1):9–17
Furuya T, Ohbuchi R (2014) Visual saliency weighting and cross-domain manifold ranking for sketch-based image retrieval. Proc. Multi-Media Modeling, Dublin, pp 37–49
Hou XD, Zhang LQ (2007) Saliency detection: a spectral residual approach. IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, pp 1–8
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259
Kekre HB, Thepade SD, Chaturvedi RN (2013) Block based information hiding using Cosine, Hartley, Walsh and Haar Wavelets. Int J Adv Comput Res 3(1):1–6
Keys RG (1981) Cubic convolution interpolation for digital image processing. IEEE Trans Acoust, Speech, Signal Process 29(6):1153–1160
Koch C, Ullman S (1985) Shifts in selective visual attention: towards the underlying neural circuitry. Hum Neurobiol 4:219–227
Lin YW, Tang YY, Fang B et al (2013) A visual-attention model using earth mover’s distance-based saliency measurement and nonlinear feature combination. IEEE Trans Pattern Anal Mach Intell 35(2):314–328
Liu T, Yuan ZJ, Sun J, Wang JD, Zhang NN (2011) Learning to detect a salient object. IEEE Trans Pattern Anal Mach Intell 33(2):353–367
Ma YF, Zhang HJ (2003) Contrast-based image attention analysis by using fuzzy growing. ACM International Conference on Multimedia, New York, pp 374–381
Mallat SG (1989) Multiresolution approximations and wavelet orthonormal bases of L2(R). Trans Am Math Soc 315(1):69–87
Mallat SG (1989) A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11(7):674–693
Murray N, Vanrell M, Otazu X, Parraga CA (2011) Saliency estimation using a non-parametric low-level vision model. IEEE Conference on Computer Vision and Pattern Recognition, Providence, pp 433–440
Qin CC, Zhang GP, Zhou YC, Tao WB, Cao ZG (2014) Integration of the saliency-based seed extraction and random walks for image segmentation. Neurocomputing 129(7):378–391
Ray SS (2012) On Haar wavelet operational matrix of general order and its application for the numerical solution of fractional Bagley Torvik equation. Appl Math Comput 218(9):5239–5248
Wang R, Yu ZX, Du LF, Lee TY (2013) Saliency-based adaptive block compressive sampling for image signals. J Image Graph 18(10):1255–1260
Wang Q, Yuan Y, Yan PK, Li XL (2013) Saliency detection by multiple-instance learning. IEEE Trans Cybern 43(2):660–672
Xie YL, Lu HC, Yang MH (2013) Bayesian saliency via low and mid level cues. IEEE Trans Image Process 22(5):1689–1698
Zhai Y, Shah M (2006) Visual attention detection in video sequences using spatiotemporal cues. ACM International Conference on Multimedia, New York, pp 815–824
Zhang LY, Tong MH, Marks TK, Shan HH, Cottrell GW (2008) SUN: a Bayesian framework for saliency using natural statistics. J Vis 8(7):1–20
Zhao GP, Yin MF, Chen Y (2013) Image salient region detection based on histogram. Xi’an CN: Proceedings of the 32nd Chinese Control Conference, pp 3570–3574
Zhu XQ, Huang JC, Shao ZF, Cheng GQ (2013) A new approach for interesting local saliency features definition and its application to remote sensing imagery retrieval. Geomatics Inform Sci Wuhan Univ 38(6):652–655
Acknowledgments
This work has been supported by the foundation of Chunhui Program from the Ministry of Education of China (GrantNo.z2011149).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sun, L., Tang, Y. & Zhang, H. Visual saliency detection based on multi-scale and multi-channel mean. Multimed Tools Appl 75, 667–684 (2016). https://doi.org/10.1007/s11042-014-2314-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2314-6