1 Introduction

When people look at an image, they are usually attracted by distinct regions called salient objects. The objective of saliency detection is to detect these salient objects in an image. Saliency detection plays an important role in many applications such as object segmentation and recognition [30], image compression [13], image classification [31], and image retrieval [32].

Feature extraction is generally one of the key elements of saliency detection algorithms [11]. According to selection of features extracted which mainly include color and texture, existing algorithms for saliency detection can be roughly divided into two categories. The first category of algorithms utilize only color as main feature to calculate saliency map [1, 2, 5, 14, 16, 18, 21, 23]. Without considering texture feature, they usually fail in handling images with complex textures. The second category of algorithms make use of color and texture features. However, they cannot adaptively combine these two features. It usually leads to low precision of the final saliency map.

Differing from the above algorithms, our method makes use of adaptive squared fusion to combine color and texture cues or features to adaptively calculate the optimum fusion ratio of the color and texture sub-saliency maps according to the image texture complexity. Images with complex textures can be efficiently processed by our method. A modified RC algorithm [5] is adopted to obtain the color sub-saliency map. First, an original image is segmented into regions using the proposed method, which combines superpixel-based segmentation [3] and the mean-shift algorithm [7] to mitigate over-segmentation and obtain semantic regions.Then, global color contrast is employed in the segmented regions to obtain the color sub-saliency map. Next, Gabor-filtered maps are employed to extract texture cues from the original maps using global texture contrast. Finally, the texture and color sub-saliency maps are combined to obtain the final saliency map using the adaptive squared approach. By using segmentation before contrast, we can overcome problems such as unclear boundaries and highlighting of only borders. Moreover, the proposed segmentation method yields semantic regions instead of fragments that would make the texture cues meaningless. Thus, by incorporating texture cues, we can efficiently handle images with complex textures. Some examples are shown in Fig. 1.

Fig. 1
figure 1

Color and texture sub-saliency maps of two example images. a, d Input images; b, e color sub-saliency maps; and c, f texture sub-saliency maps

The contributions of the proposed paper include two parts: (1) We present a novel saliency detection method that adopts adaptive squared fusion to combine texture cues and color cues in order to adaptively calculate the optimum fusion ratio of the color and texture sub-saliency maps according to the image texture complexity. By incorporating texture cues into our saliency detection algorithm, we are able to efficiency handle images with complex textures. (2) We propose an effective pre-segmentation method to handle images with complex textures. By combining superpixel-based segmentation [3] and the mean-shift algorithm [7], our method can efficiently reduce over-segmentation.

The remainder of this paper is organized as follows. Section 2 reviews some related studies. Section 3 describes the proposed method in detail. Section 4 compares the proposed approach with other state-of-the-art methods. Finally, Section 5 concludes the paper by summarizing our findings.

2 Related work

Most saliency detection methods do not take texture cues into account [1, 2, 5, 14, 16, 18, 21, 23]. Itti et al. [18] developed an algorithm based on a biologically plausible architecture. They proposed a method for detecting image saliency on the basis of a center-surround operation that imitates the behavior of a visual receptive field. On the other hand, some other methods based on pure calculation have been proposed [14-23]. Ma et al. [23] introduced a local-contrast-based fuzzy growing method to detect salient regions in images. Achanta et al. [1] extended the MZ model [23] and made use of color and luminance cues to achieve multi-scale local contrast enhancement. Hou et al. [16] presented a spectral-residual-based methods to detect salient regions in images. Based on the averaged log spectrum over a large number of images, it is easy to extract the spectral residual of an image and use it to construct the saliency map in the spatial domain. Further, Achanta et al. [2] introduced a frequency-tuned method to preserve well-defined boundaries of salient objects by retaining sufficient frequency content of an image. Goferman [14] proposed a new type of saliency detection, namely, context-aware saliency detection, which aims to detect salient regions that represent a scene. This type of saliency detection differs from the previously defined types of saliency detection, which detect only dominant salient objects. Nevertheless, it suffers from the same drawbacks, such as unclear boundaries, highlighting of only borders, high computation complexity, and potential failure in the case of images with complex backgrounds.The Fourier transform has also been employed to detect visual saliency [21]. All the methods described above ignore texture cues; therefore, they may not be able to handle images with complex textures.

On the other hand, many computational visual attention models have been proposed. Cheng et al. [5] introduced two robust methods for saliency detection, namely, the histogram-based contrast (HC) method and the region-based contrast (RC) method, which make use of the differences in global contrast and spatial coherence, respectively, to yield the saliency map. Further, they adopted the graph-based method [11] to segment images into numerous regions and then computed their saliency values. The performance of saliency detection is highly dependent on the segmentation result. However, the graph-based segmentation method usually suffers from over-segmentation, which degrades saliency detection performance. Thus, saliency detection may fail in the case of images with cluttered or textured backgrounds. Hou et al. [17] constructed the final saliency map of an image using the image signature. In addition, Perazzi et al. [27] decomposed an image into perceptually homogeneous elements; in this method, local contrast and global contrast are handled in a unified manner. Cheng et al. [6] proposed a new framework that can query images using a simple sketch. Further, Jiang et al. [19] employed the Markov random walk to estimate saliency. However, this approach may fail when the salient objects touch the image boundaries. Moreover, it is difficult to detect objects that are similar to the background in appearance. Recently, Liu et al. [22] proposed a novel saliency detection framework, namely, the saliency tree. In addition, Fang et al. [9] combined depth and texture maps for saliency detection in stereoscopic images.

Some saliency detection methods take texture cues into account [9, 33]. Two important factors affecting the design of such saliency detection models are (1) how to estimate saliency from texture cues and (2) how to combine the saliency detected using texture cues with the saliency detected by methods that do not use texture cues.

Most traditional saliency detection methods do not take texture cues into account. In this work, we attempt to mine the values of texture cues for saliency detection. To this end, we propose a novel saliency detection method that combines color and texture cues. Based on the responses of a Gabor filter, texture features are constructed and used to obtain the texture sub-saliency map. Then, the texture and color sub-saliency maps are combined in a nonlinear manner for robust image saliency detection. An important novelty of our approach is the image segmentation method that combines superpixel segmentation and the mean-shift algorithm to address the problem of over-segmentation. Recently, Wang et al. [33] also proposed a method that combines color and texture cues; however, they employed textons to represent texture cues. Fang et al. [9] also employed texture cues in their work. However, they used the first nine low-frequency AC coefficients to represent the texture feature for each image patch.

3 Proposed saliency detection method

An overview of the proposed saliency detection method is shown in Fig. 2. First, the original image is segmented into many small regions using superpixel segmentation [3]. The advantage of superpixel segmentation is that it can efficiently preserve the edges of objects. However, it produces an output containing many trivial and tiny regions; this is referred to as over-segmentation. Therefore, the mean-shift algorithm [7] is adopted to cluster these tiny patches into large homogenous regions. We tried to use other methods for clustering, such as GB and JSEG. However, we found that the mean-shift method provides the best result. The motivation for such a combination is that it can reduce over-segmentation when processing images with complex backgrounds, as compared with other methods. Second, texture and color cues are employed to obtain two saliency maps: the texture sub-saliency map and the color sub-saliency map. An improved version of the RC method [5] based on color contrast is adopted to obtain the color sub-saliency map. Third, the texture features for each region are constructed according to the responses of a Gabor filter. Then, these texture features are used to obtain the texture sub-saliency map. Finally, the color and texture sub-saliency maps are combined to obtain the final saliency map. Figures 2 and 3 show examples of this procedure.

Fig. 2
figure 2

General flowchart of the proposed algorithm. First, the original image (a) is segmented into superpixels (b); then, the mean-shift algorithm is used to cluster tiny segmented patches (c); then, Gabor-filtered maps are used to extract the texture cues (d); finally, color (e) and texture (f) sub-saliency maps are obtained and combined into the final saliency map (g)

Fig. 3
figure 3

Individual maps for each step of the proposed algorithm. a, h Input image; b, i superpixel segmentation result with down-sampled image; c, j final mean-shift segmentation; d, k color sub-saliency map; e, l texture sub-saliency map; f, m final saliency map; and g, n ground truth

figure g

3.1 Superpixel-based image pre-segmentation

The proposed image segmentation algorithm combines superpixel segmentation [3] and Mean-Shift method [7]. The specific steps of this algorithm is shown in the table of Algorithm 2: Proposed image segmentation algorithm. In the first step, we segment the input image by using SLIC algorithm [3]. In our experiment, an image with the size of 300 ∗400 pixels is segmented into about 3600 superpixels. SLIC results in the phenomenon of over-segmentation, i.e., the presence of many tiny and trivial regions in the segmentation result. If the regions are too small, it is difficult to extract meaningful and reliable Gabor features for describing the image content. Therefore, in the second step, the Mean-Shift approach is adopted to cluster the small regions into larger regions so as to greatly reduce over-segmentation problem.

We compare our algorithm with traditional segmentation methods. Figure 4 shows some examples for comparing the proposed algorithm with four existing segmentation algorithms including Graph-based [11], Mean Shift [7], JSeg [24] and Segmentation Tree [25]. The comparison result indicates that the proposed method effectively reduces over-segmentation as compared with other methods.

Fig. 4
figure 4

Comparison of graph-based segmentation, mean-shift method, and the proposed segmentation algorithm. a, g, m Original image; b, h, n segmentation result using graph-based method [11]; c, i, o segmentation result using mean-shift method [7]; d, j, p segmentation result using JSeg method [24]; e, k, q segmentation result using the proposed algorithm; and f, l, r segmented regions using the proposed algorithm, which are shown in different colors

figure h

3.2 Color sub-saliency map

This subsection describes the use of an improved version of the RC method [5] to obtain the color sub-saliency map. Instead of graph-based segmentation [11, 15], the segmentation method described in Section 3.1 is adopted to overcome the over-segmentation problem. A feature vector that records the color frequency is extracted from every segmented region. Then, the color sub-saliency map is produced according to the differences in the color feature vectors.

We compute Color sub-saliency map by the following formula [5]:

$$ SC(r_{k})=\sum\limits_{r_{k}\neq r_{i}}exp(-D_{r}(r_{k},r_{i})/\sigma^{2})w(r_{i})D_{c}(r_{k},r_{i}), $$
(1)

where S C(r k ) is the salient value of the region r k , w(r i ) is the weight value of region r i and equal to the number of pixels in r i . D r (r k ,r i ) is the spatial distance between region r k and r i and defined as the Euclidean distance between their centers of gravity. σ 2 is a positive coefficient to control the strength of spatial weight. Here, σ 2 is set as 0.4. The coordinates of all pixels are normalized to [0,1]. D c (r k ,r i ) is the color distance between r k and r i in LAB space and is computed by the following formula [5]:

$$ D_{c}(r_{1},r_{2})=\sum\limits^{n_{1}}_{i=1}\sum\limits^{n_{2}}_{j=1}f(r_{1,i})f(r_{2,j})D(r_{1,i},r_{2,j}), $$
(2)

where f(r k,i ) is the frequency of the i-th color r k,i among all n k colors in the k-th segment region r k , and k = {1, 2}.

3.3 Texture sub-saliency map

Gabor filtering [10, 29], an effective method for describing image texture, is widely used in many fields such as vehicle detection [4], face recognition [8, 20, 34], fingerprint verification [26], and palm print recognition [12]. In this study, a new type of feature is extracted according to the responses of a Gabor filter [28] in order to obtain the texture sub-saliency map. First, for every pixel in a region output by the segmentation method, the responses of a Gabor filter with four scales and six orientations are computed. Thus, 64-dimensional Gabor feature vectors are obtained. Then, for every region, the mean vector and variance vector are constructed on the basis of the Gabor feature vectors to measure the saliency of the image regions.

For a pixel P j k in the i-th region, if the response of a Gabor filter with a certain scale s and a certain orientation o is G j k (s,o) , the mean Gabor response X i (s,o) of all the pixels in the i-th region is given by,

$$ X_{i}(s,o)=\frac{1}{N_{i}}\sum\limits^{r}_{j=1}\sum\limits^{c}_{k=1}G_{jk}(s,o) $$
(3)

where N i is the number of pixels in the i-th region, r is the number of rows in the image, and c is the number of the columns in the image.

In terms of the above mean Gabor response, it is easy to compute the variance of the Gabor response with scale s and orientation o, as follows,

$$ Y_{i}(s,o)=\frac{1}{N_{i}}\sum\limits^{r}_{j=1}\sum\limits^{c}_{k=1}(G_{jk}(s,o)-X_{i})^{2} $$
(4)

Because four scales and six orientations are used to generate the Gabor responses, 24 means and 24 variances are obtained. We combine these means and variances to obtain two average values M i and V i in the i-th region.

$$ M_{i}=\frac{1}{24}\sum\limits^{4}_{s=1}\sum\limits^{6}_{o=1}X_{i}(s,o) $$
(5)

where M i is the mean value of X i for four scales and six orientations.

$$ V_{i}=\frac{1}{24}\sum\limits^{4}_{s=1}\sum\limits^{6}_{o=1}Y_{i}(s,o) $$
(6)

where V i is the mean value of Y i for four scales and six orientations. The texture cue of the i-th image region is represented as Mi and Vi which is computed by formula (4) and formula (5) respectively. Then, for the i-th region in the segmented image, the texture saliency values S T i are defined as,

$$ ST(i)=\sum\limits^{NR}_{j=1,j\neq i}\frac{N_{j}}{N_{i}}\times((M_{i}-M_{j})^{2}+(V_{i}-V_{j})^{2}) $$
(7)

where NR is the total number of regions in the segmented image. Further, N i and N j denote the number of the pixels in the i-th and j-th regions, respectively. From a psychological perspective, small regions are more likely to be salient regions. Therefore, in the above equation, the number of pixels is taken into account so that small regions can attain high saliency values.

All the pixels in the same region have identical saliency values, which is the saliency value of the region. Thus, it is easy to obtain the texture sub-saliency map, the gray values of which are the saliency values of the pixels. An example of the texture sub-saliency map is shown in Fig. 5.

Fig. 5
figure 5

General flowchart of texture cue calculation. a Original image; b 24 Gabor-filtered maps; c 24 maps with different s and o with average value X i ; d average map with average value M i ; e 24 maps with different s and o with variance Y i ; f average map with variance V i ; and g final texture sub-saliency map S T i

figure i

3.4 Adaptive squared fusion of the color and texture sub-saliency maps

Finally, the color and texture sub-saliency maps are combined in a nonlinear manner to obtain the final saliency map. For a pixel at position (i,j), the saliency values, including the color and texture cues is given by

$$ S_{ij}=r*\sqrt{(1-\alpha)SC_{ij}^{2}+\alpha \times ST_{ij}^{2}} $$
(8)

where r is a constant and S C i j is the color saliency value of the pixel. Here,we set r=1.5. Further, S T i j is the texture saliency of the pixel and α is the adaptive fusion coefficient given by

$$ \alpha =\frac{1}{T}\times exp(\frac{V_{max}}{K}-1) $$
(9)

where

$$ V_{max}=max(V_{1},V_{2}...V_{NR}) $$
(10)

Note that V m a x is the maximum of V 1,V 2...V N R and it measures the complexity of the image texture. A high value of V m a x implies that the image contents are extremely cluttered. In such cases, the texture cues contribute significantly toward the detection of salient objects. Therefore, a high value of the fusion coefficient is expected. In our experiments, good performance was achieved by setting T = 5 and K = 1600. Some examples with different α values are shown in Fig. 6.

Fig. 6
figure 6

Saliency results with different α values. a Input image, b = 0.1, c = 0.2, d = 0.3, e = 0.4, and f = 0.5. The adaptive rate (α) of the input image is 0.103

figure j

4 Experimental results

We evaluate the effectiveness of our approach using the data set employed by Achanta et. al. [2]. This data set contains 1000 images and is widely used for saliency detection. Among the 1000 images, we found 141 images which contain a lot of textures and make use of these selected images in order to compare the proposed algorithm with other state-of-the-art methods, including HC [5], RC [5], IT [18], MZ [23], FT [2], and HFT [21].

Some examples of the comparison results are shown in Fig. 7. In the gray images showing the results of saliency detection, the higher the saliency value of a pixel, the lighter is the pixel. As shown in the figure, in the case of the MZ algorithm [23], the pixels along the edges usually have high saliency values. However, the pixels within the salient objects have low saliency values. The IT method can detect only some fragments of the salient objects. Further, it is difficult to distinguish the salient objects from the background when using the HC and FT methods. The RC method [5] splits the salient objects into several parts. In the HFT method [21], the borders are unclear. In contrast, the proposed method can detect the salient objects as a whole. Moreover, it detects the edges of the salient objects accurately. Thus, it is obvious that the proposed method outperforms the other methods; the incorporation of texture cues is a crucial feature of the proposed method.

Fig. 7
figure 7

Results of saliency detection using different methods. a Original images, b IT[18], c MZ[23], d FT [2], e HC [5], f RC [5], g HFT [21], h our SMST algorithm without texture cues, i our SMST algorithm, j detected mask, and k ground truth

For the dataset used in this study, the ground truths of the salient objects, i.e., their locations, were also provided. Thus, it was easy to judge which pixels in the images belong to the salient objects. The performances of the different methods can be compared on the basis of precision-recall curves. A threshold for the saliency map can be set to transform the saliency value image into a black-and-white (i.e., binary) image. In the binary image, the white pixels are regarded as parts of the salient object, whereas the black pixels are regarded as the background. Further, precision is defined as the proportion of pixels that are classified accurately, and recall is defined as the ratio of the number of detected salient pixels to the total number of salient pixels. The precision and recall curves for our method and the other methods are shown in Fig. 8a. From these curves, it is easy to conclude that our approach outperforms the other methods and that the incorporation of texture cues is a crucial feature of our approach. Recalls with various thresholds are also compared in Fig. 8b. As shown in the figure, in general, recalls of the proposed method are greater than those of the other methods with the same thresholds.

Fig. 8
figure 8

The performance comparison between our saliency detection algorithm with other onea Precisionrecall curves, b recall curves, and c bar graph

The F β score is another index commonly employed to measure saliency detection performance. This score combines two indexes, recall and precision; it is defined as,

$$ F_{\beta}=\frac{(1+\beta^{2})Precision\times Recall}{\beta^{2}\times Precision+Recall} $$
(11)

The parameter β 2 controls the contribution of precision to the score. In our experiments, β 2 was set to 0.3. When the values of precision and recall are high, the F β score will be high. The three indexes (i.e., precision, recall, and F β ) corresponding to different thresholds for our method and the other methods are averaged and compared in Fig. 8c. Our approach clearly outperforms the other methods in terms of these indexes. As compared to the RC algorithm, the proposed method uses additional information based on the texture cues. The comparison results shown in Fig. 8 indicate that our method outperforms the RC algorithm. Furthermore, the experimental results confirm the importance of texture cues for saliency detection.

In Fig. 8, the precision, recall, and F-measure of SMST (precision = 88 %, recall = 83 %, F-measure = 86 %) are better than RC (precision = 82 %, recall = 87 %, F-measure = 83 %). As a result, we can get better salient objects using the saliency maps of SMST than RC based on the segmented method proposed by Achanta [2].

5 Conclusion

We proposed a novel method for detecting salient objects in images with complex textures. The proposed method uses color cues to obtain the color sub-saliency map of an image. Based on the responses of a Gabor filter, texture features are extracted from the image in order to obtain the texture sub-saliency map. Then, the color and texture sub-saliency maps are combined to obtain the final saliency map. The results of our experiments showed that the proposed method outperforms other state-of-the-art methods.