1 Introduction

Vision is one of the most intensively studied directions in the research of human brain, especially with pre-attentive visual selection behavior, which makes a lot of scientists fascinated, including cognitive neuroscience, neuropsychology, and computer science and so on [22]. Recently, theoretical studies of vision indicate that neural activities in retina and primary visual cortex (V1) represent the saliencies of visual inputs in a bottom-up pattern, such that visual information can be efficiently encoded and selected for further detailed or attentive processing [22, 41]. Saliency of vision, a bottom-up visual process, is highly related to visual uniqueness, differences, clarity, surprise, and so on. Inspired by biological vision, there is a lot of work to exploit image properties such as color, illumination, gradient, edges, the spatial relationship of foreground and background to estimate saliency [2, 8, 10, 19, 33].

Different from general segmentation algorithms partitioning an image into multiple regions with coherent properties, saliency detection aims to identify salient object regions from an image. Since the saliency map can represent the important information in the source image, saliency detection plays a vital role in image understanding, analysis, and processing. It has been applied to a variety of applications including image segmentation [37], object recognition and understanding [31], content-aware image/video retargeting [23] [38], content-based image retrieval [14], and image/video compression [18], etc.

In order to obtain high quality and accurate saliency detection results, a lot of bottom-up works have been proposed to explore the distinction between objects and backgrounds in recent years. We review the representative and relative works that employ various visual cues for salient object detection. For a more comprehensive survey of state-of-the-art in visual attention modeling, we refer readers to [35] and [8].

A widely used saliency cue is the globally statistical features of an image, including color contrast [10], luminance, edges and gradients [19], and spectral analysis [2]. Itti et al. [19] propose to use color contrast for salient region detection. Seo and Milanfar [33] propose a saliency measure called self-resemblance to calculate pixel’s saliency. The self-resemblance measure is computed by comparing a pixel to its surroundings. Achanta et al. [2]. propose to compute the saliency of each pixel based on the difference between pixel’s color and the average image color. Hou and Zhang [17] propose a fast saliency detection approach based on the discovery of spectral residual. Their approach is able to detect foreground objects in an image without any prior knowledge. Rahtu et al. [30] employ a conditional random field (CRF) model to segment initial saliency map, which is produced by using local feature contrast in illumination, color, and motion information. Since sliding window is used in their approach, the authors exploit integral histogram method and graph cut solvers to improve the computational efficiency.

Besides, the image background information is also exploited in several approaches [21, 42] and has been proven to be a useful saliency clue. Zhu et al. [42] proposed a background measure called boundary connectivity, which is the ratio of the connected boundary length of a superpixel to the superpixel size. Because image segmentation itself is a unsolved problem, it is hard to estimate the size of a superpixel and its connected boundary length. So Zhu et al. [42] proposed a “soft” approach to compute each superpixel size and its connected boundary length. They constructed an undirected graph with edges weighted by the color contrast between neighboring superpixels. The contribution of a superpixel to another one are computed based on the accumulated edge weights along their shortest path on the graph. The “soft” computing approach can solve the estimation problem of superpixel size and its connected boundary length to some extent. However, it fails when the colors of salient object are similar to boundary. Besides, it is not an efficient approach for computing.

The approaches mentioned above usually operate on raw image or video. Recently, several approaches exploit the compressed domain information such as transform coefficients and motion vectors to directly detect the saliency of an image [12] or a video [13] in the compressed domain. Fang et al. [12] is a good example. Fang et al. [12] propose a saliency detection approach in the compressed domain. They calculate four feature maps for a JPEG image based on the image features including intensity, color, and texture. These features are extracted based on the DCT coefficients. The final saliency map for the JPEG image is obtained by integrating these four feature maps. Their approach yields impressive results.

Detecting saliency in crowded scenes is a novel work. Jiang et al. [20] propose an interesting approach for detecting saliency in crowded scenes. Based on the observation that face features play an important role in determining saliency, especially in the context of crowd, the authors extract low-level center-surround contrast and high-level semantic face features for saliency prediction in crowd. To automatically combine these features for predicting saliency in crowd, they use multiple kernel learning (MKL) to learn a classifier from their built eye-tracking dataset [20]. Based on the Random Forest algorithm, Ma et al. [26] propose a new crowd saliency prediction approach by optimizing feature combination. Instead of only using traditional features such as low-level features (i.e., color, intensity, orientation) and face features (i.e., face size, face density, frontal face, profile face), they define two new features, FaceSizeDiff and FacePoseDiff, to improve the quality of saliency detection [26].

Since visual attention is often affected by various factors, we need to consider multiple visual cues simultaneously in order to obtain accurate and robust saliency detection results. In fact, by integrating multiple cues to get the final results, this idea can be seen in a lot of works. Yang et al. [39] is a good example for image quality prediction. Yang et al. [39] propose two novel sub-models to separately process user-generated images, which is a multi-dimensional data including text, image, and social relations. The results of these two models are fused together to generate the final score of quality prediction. The results of their approach indicate that their predicted score is fairly consistent with the ground truth.

In this paper, we propose a novel bottom-up approach to automatically detect salient object regions in an image. Our approach is performed in superpixel level for reducing computations. We first segment the input image into a set of superpixels using superpixel segmentation algorithm [24]. Since the superpixels are often the result of over-segmentation, the regions in the input image having coherent image attributes may be partitioned into multiple independent superpixels. In order to make the representation of regions more compact while further reducing the number of superpexels, we propose to fuse neighboring superpixels with consistent image features such as color and texture. After the input image has been separated into several distinct regions, our goal is to find the salient region in the separated regions. Based on the widely accepted biological visual saliency cues, we propose four saliency weights, i.e., local contrast weight, superpixel clarity weight, background probability weight, and central bias weight, to effectively measure the saliency of each region. The results of these four weights will be integrated together to produce our final saliency map. Furthermore, in order to obtain a clean saliency map that its saliency areas are more consistent with the object regions, we propose a superpixel-level saliency smoothing algorithm to optimize the integrated saliency map. The overview of our approach is presented in Fig. 1. The key contributions of our paper are summarized as follows:

  • We propose a superpixel fusion algorithm, which is helpful to reduce the number of superpexels and makes the saliency map more consistent with object. The main idea is to fuse neighboring superpixels with consistent features such as color and texture.

  • We propose four powerful saliency weights that consider clarity of superpixels, spatial informations and color contrast between superpixels. These saliency weights have low computational complexity and are capable of effectively representing visual saliency cues. In addition, the four resulting saliency weights are integrated in a principled way via multiplication and summation based fusion.

  • In order to optimize the integrated saliency map obtained above, we propose a superpixel-level saliency smoothing algorithm to make the saliency areas more consistent with the object regions.

Fig. 1
figure 1

Procedure of the proposed approach

In the following sections, we will detail these four saliency weights and the saliency smoothing algorithm, and show how each weight is to determine the saliency of each region. The remainder of this paper is organized as follows. In Section 2, we introduce our saliency detection approach in detail. Section 3 presents the experimental results. An application of our approach is introduced in Section 4. We will draw our conclusions in Section 5.

2 The proposed approach

As Fig. 1 shows, firstly, the input images are segmented into a set of superpixels using a fast and robust superpixel segmentation algorithm [24]. Secondly, we fuse those neighboring superpixels with consistent color and texture features in order to reduce the number of superpixels. Then, based on the theory of human visual attention and the observation of spatial layout difference of image background and foreground, we propose four saliency weights, i.e., local contrast weight, superpixel clarity weight, background probability weight, and central bias weight, to effectively measure the saliency of each fused superpixel. Finally, the final saliency maps are obtained by integrating all saliency weights. Furthermore, to obtain a clean saliency map, we perform saliency smoothing step to optimize the integrated saliency map. Figure 2 illustrates the pipeline of our saliency detection approach. In the following subsections we will describe our approach in detail.

Fig. 2
figure 2

The pipeline of our approach. (a) Input Image, (b) Superpixel Segmentation, (c) Superpixel Fusion, (d) Local Contrast Weight, (e) Superpixel Clarity Weight, (f) Background Probability Weight, (g) Central Bias Weight, (h) Integration Weight, (i) Superpixel-Level Saliency Smoothing, and (j) Ground Truth. It demonstrates that after combining the weights of local contrast, superpixel clarity, background probability, and central bias, we get high quality saliency maps (h and i) comparable to human labeled ground truth

2.1 Superpixel segmentation and fusion

Our approach is performed in superpixel level, so we first segment the input image into a set of superpixels using Liu’s algorithm [24] (we use a MATLAB implementation from http://mingyuliu.net/), which is fast and robust for images with different natural scenes. Figure 2b shows the over-segmentation results.

Then, we fuse neighboring superpixels with coherent features. The motivation for performing superpixel fusion is to reduce the impact of superpixel segmentation results on the saliency detection. Besides, fusing the superpixels with coherent features is not only beneficial to improve the computational efficiency, but also makes the final saliency value more consistent. We use a four-dimensional feature vector to represent each superpixel. Following [15], the feature vector consists of CIE-Lab color and Gabor filter. The Gabor filter responses with 8 orientations. Both the bandwidth and the extracted scale are set to one. The amplitude response of Gabor filter is calculated by combining 8 orientations as the texture feature. When the feature contrast of two neighboring superpixels is less than the threshold T, the two superpixels are fused into a new large superpixels. Therefore, the number of superpixel clusters is determined by the content of the detected image. In the experiment, we observe that the number of superpixel clusters after fusion is about 4 ∼18. The threshold T is defined as:

$$ T=mean({SP}_{contr})-std({SP}_{contr})/2 $$
(1)

where SP contr denotes the contrast of neighboring superpixels in the CIE-Lab color and texture feature space for the detected image. mean(SP contr ) and std(SP contr ) denote the mean and the standard deviation of SP contr , respectively.

Figure 2c shows the results of superpixel fusion. From Fig. 2c we observe that the background consists of only a few superpixels, and the foreground is essentially represented by a new large superpixel. In our approach, we use vector f = {f i }, i = 1,2,…, M, to denote the fused superpixels. M represents the total number of fused superpixels. f i denotes the i th fused superpixel.

2.2 Saliency weight calculation

Local contrast weight

The human visual system pays close attention to the local part of an image. Theories of physiology of vision and neuroscience has proven that the central 10 °C of visual field is represented by at least 60% of the visual cortex and has the greatest visual acuity and color sensitivity [6, 11, 32]. Figure 3 illustrates an example. It shows only a small portion of the image will be processed by the human visual system carefully, while the rests are almost ignored. This conclusion is also accepted by human visual attention theory [3, 19, 21, 34, 36].

Fig. 3
figure 3

(a) Visual acuity as a function of position on the retina. Note that visual acuity is maximal at 0 °C eccentricity (the central visual field), whereas it is minimal in more peripheral areas [6, 11, 32]. (b) Original image. (c) Pixel spatial distribution. (d) An example of retinal imaging: from the image center to the peripheral areas, the resolution is changed from high to low

Based on the theories of visual field and human visual attention [19, 21, 36], we propose a local contrast approach to calculate the saliency value of each superpixel. The superpixels having more contrast than its surroundings attract more visual attention. This particular superpixel will be selected as the perceptually salient region.

It should be noted that our local contrast approach is different from the widely used global contrast method. Figure 4 shows an illustrative example of global contrast vs. our local contrast. In Fig. 4, obviously, the most salient object is the red block instead of the black block in the input image. The possible reason is that the red block is surrounded by the white background in local area, while the black block is surrounded by the white background as well as a green block and a blue block. The saliency contribution of the white area on the top right corner of the black block is greatly reduced because the green block and the blue block are on the top and right side of the black block, respectively.

Fig. 4
figure 4

Global contrast vs. our local contrast. The red block in the input image (left) is more salient than the others. If using the global contrast method, the black block becomes the most salient object (middle), while the detection results of our local contrast is the red block (right), which gives more consistent result with the human visual attention

Specifically, we define the saliency of a superpixel f i using its feature contrast to its surrounding superpixels in the image. To calculate the local contrast weight, we construct an undirected weighted graph by connecting all neighboring superpixels (f i , f j ) and assigning their weight Dist(f i , f j ) as the Euclidean spatial distance between superpixel f i and f j . The local contrast weight W LC (f i ) of a superpixel f i is defined as:

$$ W_{LC}(f_{i}) = {\sum}_{j=1}^{M}\frac{C_{f_{i},f_{j}} \cdot \sqrt{Size(f_{j})}}{Dist(f_{i},f_{j})} $$
(2)

where Size(f j ) is the number of pixels in superpixel f j . \(C_{f_{i},f_{j}}\) is the local feature contrast between superpixel f i and superpixel f j . Note that \(C_{f_{i},f_{j}}\) is different from feature distance between superpixels f i and f j in the CIE-Lab color and texture feature space. \(C_{f_{i},f_{j}}\) is computed as:

$$ C_{f_{i},f_{j}} = \left\{\begin{array}{llllllll} Contr(f_{i},f_{j}), & if~(f_{i},f_{j})~adjacent\\ Contr(f_{i},f_{j}) - \max \limits_{k\in Path(i,j)} Contr(f_{k},f_{k+1}), & if~(f_{i},f_{j})~not~adjacent \end{array}\right. $$
(3)

where Contr(f i , f j ) is the feature contrast between superpixels f i and f j in the CIE-Lab color and texture feature space. Equation (3) shows that when the superpixel f i and f j are not adjacent, the local contrast \(C_{f_{i},f_{j}}\) equals Contr(f i , f j ) minus the maximum feature contrast along their shortest path on the graph. That is to say, the final contrast of the superpixel f i and f j should consider not only their feature contrast, but also the feature contrasts in their shortest path on the graph.

Equations (2) and (3) encourage those superpixels with large feature contrast to its surrounding regions. Note that this is quite different from global contrast method which defines the saliency for each region as the weighted sum of the region’s contrasts to all other regions in the image [10]. We calculate the shortest paths between all superpixel pairs using algorithm [7]. As our graph is very sparse, computing (3) is efficient and low storage. Figure 2d shows the results of normalized local contrast weight.

Superpixel clarity weight

Compared with blur regions, we are usually more interested in objects that are clarity in an image. So the clarity cue should be considered in the calculation of perceptually salient region. The question is how to measure the clarity of a region in an image.

Based on the observation that image clarity is correlated with the image attributes such as richness of edge, contrast, illumination, etc. We propose an approach to measure the clarity of each superpixel. The superpixel clarity weight W SC (f i ) of a superpixel f i is defined as:

$$ W_{SC}(f_{i}) = \frac{Edge(f_{i})}{Gray(f_{i})} $$
(4)

where Edge(f i ) is the average value of color edge of superpixel f i [16]; Gray(f i ) is the average gray value of superpixel f i . Equation (4) means that if a region has rich edges and relatively low illumination, its clarity is relatively high. Figure 2e shows the results of normalized superpixel clarity weight. It demonstrates that the importance of each superpixel can be well discriminated according to (4). Figure 5 shows more results of superpixel clarity weight.

Fig. 5
figure 5

More examples of superpixel clarity weight. It clearly demonstrates the importance of each superpixel can be well discriminated according to the superpixel clarity weight

Background probability weight

Intuitively, background regions are much more connected to image boundaries than foreground ones, i.e., the less the regions touch the image boundary, the more salient they will be. Zhu et al. [42] proposed a “soft” approach to compute the boundary connectivity, which is inefficient and fails when the colors of salient object are similar to boundary. Different from [42], we directly compute each superpixel size and its connected boundary length based on the results of our superpixel fusion. The background probability weight W BG (f i ) of a superpixel f i is defined as:

$$ W_{BG}(f_{i}) = exp\left( -\frac{\left( \frac{Conbd(f_{i})}{\sqrt{Size(f_{i})}} - {\omega}_{bg}\right)^{2}}{2 {\sigma}_{bg}^{2}}\right) $$
(5)

where Conbd(f i ) is the number of pixels in the image boundary of superpixel f i . The square root of the superpixel size is to achieve image size-invariance. In our implementation, ω bg and σ bg are set to 0 and 0.2, respectively.

Based on the superpixel fusion results, the calculation of (5) is very fast and effective because we only need to count the number of pixels in the image boundary and each superpixel. This is feasible because we experimentally find that the foreground and background in the input image are usually represented by only one or a few superpixels after superpixel fusion operation. Figure 2f shows the results of normalized background probability weight. It shows that the background in the input image can be well depressed according to (5).

Central bias weight

In human visual system, the image center regions draw more attention than the other regions [15], i.e., the saliency values of central regions are higher than the image boundary regions. Many works use the central bias as a saliency cue to suppress the background close to image boundary [4, 15, 42]. In this fashion, the central bias weight W CB (f i ) of a superpixel f i in our notation can be written as:

$$ W_{CB}(f_{i}) = \frac{{\sum}_{p_{k} \in f_{i}}^{} e^{\frac{-{Dist(p_{k},0)}^{2}}{2 {\sigma}_{cb}^{2}}}}{Size(f_{i})} $$
(6)

where Dist(p k ,0) is the spatial distance between pixel p k and image center. In our implementation, we set σ cb = 0.5. Equation (6) shows that the central bias weight of each superpixel is obtained by averaging the central bias weights of pixels in each superpixel. Figure 2g shows the results of normalized central bias weight. It demonstrates that the central bias saliency cue can depress the boundary backgrounds in the input image.

2.3 Weight integration

So far, we have introduced four bottom-up saliency weights. If used independently, each weight has its merits and, of course, demerits. The common integration approaches are linear summation and pixel-wise multiplication of all the saliency weights [15]. Figure 6 shows the difference between summation and multiplication combinations. Generally, multiplication encourages the common saliency regions in each weight and gives the saliency of higher precision. Summation favors to obtain higher recall.

Fig. 6
figure 6

Weights combination: summation and multiplication. The background noises can be effectively depressed by multiplying each weight, which can improve the saliency detection accuracy. And summation favors to obtain higher recall

In this paper, we integrate the advantages of multiplication and summation and use the following principle to combine our four saliency weights mentioned above. For a superpixel f i , the integration of these weights is defined as:

$$ S(f_{i}) = \omega \cdot S_{multi}(f_{i}) + \varphi \cdot S_{sum}(f_{i}) $$
(7)
$$ S_{multi}(f_{i}) = W_{LC}(f_{i}) \cdot W_{SC}(f_{i}) \cdot W_{BG}(f_{i}) \cdot W_{CB}(f_{i}) $$
(8)
$$ S_{sum}(f_{i}) = \alpha \cdot W_{LC}(f_{i}) + \beta \cdot W_{SC}(f_{i}) + \gamma \cdot W_{BG}(f_{i}) + \lambda \cdot W_{CB}(f_{i}) $$
(9)

where W LC , W SC , W BG , and W CB are our four saliency weights. S(f i ) is the integrated weight of the superpixel f i . In our implementation, ω and φ are set to 0.5, which means that the results of multiplication and summation make the equivalent contribution to the integration result. The parameters in (9) such as α, β, γ, and λ are empirically set to 0.5, 0.5, 1, 0.5, respectively. Note that there are some approaches that aim at automatically fusing multiple saliency weights/cues via learning algorithms. In our experiment, instead of using the complex automatic fusion method, we empirically set the integrating parameters in (7) and (9).

2.4 Saliency smoothing

By integrating each saliency weight, we have obtained the saliency map. Based on the observation that the neighboring superpixels with coherent color and texture features should have consistent saliency values, we propose to refine the integrated saliency map by performing superpixel-level saliency smoothing. Specifically, the saliency value of a superpixel is equal to the weighted average of the saliency values of other superpixels. When calculating the smoothed saliency value of a superpixel, we not only consider the feature contrast between the superpixel and other superpixels, but also consider the spatial distance between them.

For a superpixel f i , the smoothed saliency value S (f i ) is defined as:

$$ S^{\prime}(f_{i}) = h\left( \frac{{\sum}_{j=1}^{M} \left[ 1-Contr(f_{i},f_{j}) \right] \left[ 1-Dist(f_{i},f_{j}) \right] S(f_{j})}{{\sum}_{j=1}^{M} \left[ 1-Contr(f_{i},f_{j}) \right] \left[ 1-Dist(f_{i},f_{j}) \right]} \right) $$
(10)
$$ h(x) = e^{-\frac{(x - {\omega}_{sm})^{2}}{2 {\sigma}_{sm}^{2}}} $$
(11)

Equation (10) encourages those superpixels with low feature contrast and small spatial distance to make more contribution for the smoothed saliency value S (f i ). Equation (11) is used to normalize the smoothed saliency value S , which is computed by (10). In our implementation, ω sm and σ sm are set to 1 and 0.2, respectively.

Figure 2i shows the results of smoothed saliency maps. It illustrates that the non-saliency noises in the weight combination results (Fig. 2h) are obviously reduced. It should be noted that although saliency smoothing operation can optimize the integrated saliency map, the quality of saliency detection depends mainly on the four powerful saliency weights mentioned above.

3 Experimental results

3.1 Experimental setup

Dataset

To evaluate our approach, we carried out several experiments on three standard benchmark datasets: MSRA [2], SED1 [4], and SED2 [4]. MSRA [2] consists of 1,000 images with different natural scenes and complex backgrounds. SED1 [4] consists of 100 images with low contrast and cluttered background scenes making it challenging for saliency detection. SED2 [4] contains 100 images with two salient objects. The human-labeled foreground masks used as ground truth for salient object detection in MSRA [2], SED1 [4], and SED2 [4] datasets are also provided.

Evaluation criterion

In our experiments, we adopt five criteria to evaluate the quantitative performance of different approaches: receiver operating characteristic (ROC) curve, mean absolute error (MAE) [29], mean precision, mean recall, and F-measure. The ROC curve plots the true positive rate against the false positive rate and presents a robust evaluation of saliency detection performance. Specifically, the ROC curve is obtained by thresholding the saliency map using a series of fixed integers from 0 to 255.

MAE is proposed by [29], which provides a estimate of the dissimilarity between the saliency map and ground truth. It calculates the mean absolute error between the detected saliency map (S) and the binary ground truth (GT). MAE is computed as:

$$ MAE = \frac{{\sum}_{i=1}^{W}{\sum}_{j=1}^{H}\left|S(i,j)-GT(i,j) \right|}{W\times H} $$
(12)

We also use F-measure to evaluate the overall performance. F-measure is computed as:

$$ F_{\gamma} = \frac{(1+\gamma^{2}) \times Precision \times Recall}{\gamma^{2} \times Precision + Recall} $$
(13)

where precision and recall are an average value which is obtained by averaging a number of precisions of thresholding saliency map. As described in [2, 25], precision is more important than recall for attention detection. Therefore, we use γ 2 = 0.3 to weigh precision more than recall.

3.2 Comparison of saliency detection approaches

For convenience, we use (MSW) (Multiple Saliency Weights) to represent our multiple-weight integration approach. We compare MSW with the following representative saliency detection methods, including SR [17], IT [19], SIM [27], SUN [40], AC [1], SeR [33], AIM [9], FT [2], SEG [30], and wCtr [42] respectively. These approaches are very typical in saliency detection and implemented using their either publicly available source code or original saliency detection results from the authors.

Figure 7 reports the experimental results of all approaches on the SED1, SED2, and MSRA datasets. The results demonstrate the overall better quality of saliency maps generated by using our MSW approach in terms of the measures of MAE, ROC, and F-measure. Specifically, Fig. 7 (left column) shows the MAE comparison results of all approaches, which indicate that our MSW approach obtains the lowest MAE scores in SED1 [4], SED2 [4], and MSRA [2] datasets, except for wCtr [42] approach on SED2 and MSRA [2]. The experimental results demonstrate that the saliency maps produced by the proposed approach are more consistent with the ground truth.

Fig. 7
figure 7

MAE, ROC curve, and F-measure performance of all the ten approaches on the SED1 [4], SED2 [4], and MSRA [2] datasets. From top to bottom: SED1 [4], SED2 [4], and MSRA [2] datasets are tested. From left to right: MAE, ROC curves, and F-measure performance are displayed. The experimental results demonstrate the overall better quality of saliency maps generated by using our approach

Figure 7 (middle column) shows the comparison results of ROC curves of our MSW approach and other methods. Since our approach takes into account the multiple visual cues instead of a single cue, our MSW approach reasonably outperforms other competing methods in SED1 [4], SED2 [4], and MSRA [2] datasets. Given a fixed false positive rate, MSW obtains a higher true positive rate than other saliency detecting approaches in most cases.

Furthermore, Fig. 7 (right column) also shows the average F-measure performances of our MSW approach and other methods on the SED1 [4], SED2 [4], and MSRA [2] datasets. The experimental results demonstrate that our MSW approach outperforms other methods in terms of precision, recall, and F-measure on both two standard benchmark datasets in most cases.

Figure 8 presents some visual examples of salient object detection on the MSRA [2] and SED1 [4] datasets for a subjective comparison. It intuitively demonstrates that the saliency maps obtained by our approach provide visually more pleasuring saliency detection results than other competing approaches, and surprisingly, are more close to the ground truth.

Fig. 8
figure 8

Visual comparison of saliency detection on the MSRA [2] and SED1 [4] datasets. (a) input images (IM), (b) - (j): saliency maps generated using different approaches, (k) our MSW and (l) MSWS (Multiple Saliency Weight Smoothing) approaches, and (m) ground truths (GT)

In Table 1, we compare the average processing time on SED1 [4] with other saliency detection approaches mentioned above. The processing environment has an Intel®; Xeon®; CPU with 2.53 GHz operational frequency and 24G bytes RAM size under Windows®; Server 2008 operating system. All the algorithms are implemented by MATLAB. Table 1 demonstrates that the time complexity of our MSW approach is relatively low compared with other methods.

Table 1 Comparison of processing time (seconds per image)

4 Application

The result of saliency detection can be used to improve the existing image processing applications. Content-aware image retargeting is a good example. It judiciously retargets an image to the target resolution based on an importance map for the image [5, 28]. We experiment with using different saliency maps in the image retargeting approach: As Similar-As-Possible (ASAP) [28]. ASAP [28], a typical continuous approach, was proposed by Panozzo et al [28]. It optimizes a mapping (warping) from the resolution of source image to some target resolution using several types of constraints in order to protect the important contents in the source image.

Figure 9 demonstrates the retargeting results using different saliency maps. The saliency maps are from image gradient, FT [2], IT [19], and our MSW. It intuitively demonstrates that the retargeting approach of ASAP [28] employing our saliency maps is capable of producing better retargeting results. This is reasonable because the saliency maps produced by our approach are consistent with the object regions, which is important for importance map based retargeting approaches. However, gradient maps often have higher saliency values at object boundaries. The saliency maps of FT [2] and IT [19] cover less object regions, which are not suitable for image retargeting.

Fig. 9
figure 9

Image retargeting results (75% original width). Comparison of content-aware image retargeting results [28] using the saliency maps of gradient, FT [2], IT [19], and our MSW

5 Conclusion

In this paper, we have presented a novel salient object detection approach that estimates the saliency of regions by using our four powerful saliency weights, i.e., local contrast weight, superpixel clarity weight, background probability weight, and central bias weight. The final saliency maps are obtained by integrating these saliency weights. Furthermore, we propose a superpixel-level saliency smoothing approach to optimize the integrated results for obtaining clean and consistent saliency maps. Extensive experiments on three standard benchmark datasets show that our approach achieves good performance and is computationally efficient. In the future, we aim to exploit the extrinsic information (such as the images having visually similar content with the original image) to further improve the performance of our algorithm.