1 Introduction

Attempting to search for and recognize particular known object in a scene can be extremely complex when one has to consider all possible views an object can take. Human vision system employs attention to try to limit the amount of information that needs to be processed in order to speed up search and recognition, and to interpret complex scenes in real time [46]. Predicting locations at where people are likely to look has many real-world applications. Computational models can be applied to various computer vision tasks such as navigational assistance [11], robot control [6], surveillance systems [18], object detection and recognition [26], and scene understanding [23]. Such predictions also find applications in other areas including adaptive image and video compression [10], pictorial database querying [7] and content-aware image editing [41].

Visual saliency originates from visual uniqueness, unpredictability, or rarity, and is often contributed to variations in image features like color, gradient, edges and boundaries. Although many factors may determine what image features are selected or discarded by our attentional processes, visual saliency can generally be categorized into two groups: bottom-up and top-down [9]. The former comprises data driven (instantaneous) processes while the later comprises processes that are dependent of the organism’s internal state (such as the visual task at hand or the subject’s background) [29].

Because bottom-up saliency is important for many practical applications, we focus on bottom-up data driven saliency in this paper. The fast, parallel, pre-attentive, bottom-up stage of human vision is thought to guide a serial (computationally intensive) attentive, top-down stage. Among all features that contribute to image saliency, orientation and color are thought to be the most significant ones [22, 40]. Consequently, most current computational saliency models are based on color or orientation contrast (e.g. [8, 23, 39]). However, in color science, chromaticity determines the quality of a color regardless of its luminance, and it was reported in [35] that humans were more sensitive to chromaticity cues than to luminance ones. Moreover, there exists evidence that the human visual system combines low-level features in an early stage [2], and information theory can be used to compute the saliency by combining chromaticity and contrast in an early stage.

Therefore, in this paper, we propose a method which computes image saliency from the color saliency boosting version of original image. The method is based on the key observation that in natural images, color transitions of equal probability (i.e. isosalient transitions) form ellipsoids in decorrelated color spaces [36]. The transformation that turns these ellipsoidal surfaces into sphere ones (called the color saliency boosting function), ensures that vectors of equal length have equal information content and thus equal impact on the saliency function. In [36], the color boosted saliency map was computed by a simple gradient-based scheme. In this paper, we further investigate how to achieve a more accurate color boosted saliency map to predict human fixations.

It is widely believed that human cortical cells may be hard wired to preferentially respond to high contrast stimulus in their receptive fields [31]. Computational saliency models have been shown to successfully model human saliency by determining the contrast of image regions to their surroundings [44], using feature attributes such as intensity, color [8], and edges [1]. Inspired by currently released global contrast based visual saliency detection method [8], we utilize region based dissimilarity to produce the final color boosted saliency map. Similar to [8], the saliency value of a region is now calculated using a global contrast score, measured by the region’s contrast and spatial distance to other region in the image. The main difference between our method and global regional contrast based visual saliency detection method [8] is that, we use the color boosted image instead of original image as an input. In addition, for purpose of further improvement, we initialize the segmentation using a modified version of graph-based image segmentation method [45], and contrive a spatial weighting term to describe a bias to the center of image.

The main contribution of this paper is to combine the respective merits of color saliency boosting and global region based contrast schemes to achieve more accurate saliency maps. We also have extensively evaluated our method on four publicly available benchmark datasets provided by Bruce et al. [5], Judd et al. [24], Linde et al. [37] and Achanta et al. [1], and compared our method with eight state-of-the art saliency methods [1, 5, 8, 12, 19, 21, 23, 44]. The experiments showed consistent and improvements over previous methods on all datasets, indicating the robustness and generality of proposed method.

The rest of the paper is organized as follows. Section 2 introduces the related works. Section 3 outlines the color saliency boosting algorithm we used in this paper. The regional contrast based saliency detecting approach is presented in Section 4, and the comprehensive experimental results are given in Section 5. Section 6 demonstrates how the extracted saliency map can be used for image classification. Finally, the conclusions are outlined in Section 7.

2 Related work

2.1 Color distinctiveness

Information theory has been successfully applied in modeling human visual from image feature [25]. The theory declares that feature saliency is inversely related to feature occurrence, i.e. rare features are more informative and thereby more salient than features that occur more frequently. Consequently, recent models for predicting human visual fixation behavior suppose that saliency driven free viewing corresponds to maximizing information sampling [14]. Such models have been successfully applied to model fixation behavior, saliency asymmetries, and even to solve the classic computer vision problem such as dynamic background subtraction [14].

Based on information context of color image derivatives, van de Weijer et al. [36] proposed a so called color saliency boosting algorithm to exploit color distinctiveness of image. The key observation behind their method was that in natural images, color transitions of equal probability (i.e. isosalient transitions) form ellipsoids in decorrelated color spaces. In their paper [36], color saliency boosting was subtly designed as a generic method that can be easily adaptable to existing feature detectors. As a result, it had been successfully applied to image retrieval [33] and image classification [27].

2.2 Contrast based visual saliency detection

We pay attention to relevant literature aiming at parallel, pre-attentive, bottom-up saliency detection, which may be biologically motivated, or purely computational, or involve both aspects. In general, we slice such methods into two groups: local and global schemes.

Local contrast based methods measure the rarity of image regions with respect to (small) local neighborhoods. Among local contrast based models of saliency, the model of Itti et al. [23] was one of the most influential, summing the scale-space center-surround excitation responses of feature maps at different spatial frequencies and orientations and feeding the results into a neural network to output the final saliency map. Seo & Milanfar [32] used a self-resemblance mechanism to compute saliency, where a region with dissimilar curvature compared to its surrounding was considered as being highly salient. Bruce & Tsotsos [5] designated saliency at a location to be quantified by the self-information of the location with respect to its surrounding context—either localized pixel regions, or even the entire image. More recently, Goferman et al. [16] simultaneously modeled local low-level considerations (such as color and contrast), global considerations, visual organizational rules, and high level factors (e.g. humance faces) to detects the important parts of the scene. However, such local contrast based methods have some obvious disadvantages: they tend to produce higher saliency values near edges instead of uniformly highlighting salient object.

Global contrast based method compute the saliency values of an image region using its contrast with respect to the entire image. This global rarity principle agrees better with human intuition and has been implemented in different ways. Frequency based methods perform analysis in the spectral domain [1]. Achanta et al. [1] proposed a frequency tuned method that directly computed pixel saliency using a pixel’s color difference from average image color. In [28], low spatial variance of a feature was considered to indicate high saliency, and this made an overly strict assumption that the background everywhere was dissimilar to the object. Zhai and Shah [44] considered pixel saliency based on pixel’s contrast to all other pixels. However, for efficiency, they utilized only luminance information, thus ignoring distinctiveness clues in other channels. More recently, M. M. Chen et al. [8] presented a simple but efficient algorithm that defined pixel-level saliency using global contrast differences.

3 Color saliency boosting

We first provide a high level description of our model, with the implementation details given in the following sections. To exploit color distinctiveness of image, our method begins by boosting color saliency on the original input images. The corresponding saliency maps are then extracted by using of global region based contrast schemes. A flow chart describing this process is given in Fig. 1.

Fig. 1
figure 1

A high-level overview of the proposed method

The color saliency boosting algorithm by van de Weijer et al. [36] has provided an efficient method to exploit the saliency of color edges based on information theory. The algorithm is inspired by the notion that a feature’s saliency reflects its information content as follows.

Consider a color image f = (R,G,B). The information content, I, of first-order directional derivative f d , according to information theory, is given by logarithm of its probability p:

$$ I\left( {{f_d}} \right) = - \log \left( {p\left( {{f_d}} \right)} \right) $$
(1)

where p(f d ) is the probability of the spatial derivative and it can be calculated from a large image database, e.g. the Corel database which consists of 40,000 images. Therefore, color image derivatives which are equally frequent, from now on named iso-salient derivatives, have equal information content.

In [36], the authors concluded that the color saliency boosting function g(.) which maps iso-salient derivatives to equal saliency is required. It is our goal to model the surface of iso-salient derivatives with an ellipsoid. We note that the same influence of luminance changes on the chromatic changes in the image is reflected not only in the derivatives of first order, but also in higher order. Therefore, we use the color tensor [43] to estimate the parameter of the ellipsoid as:

$$ G = \left[ {\matrix{ {R_x^2 + G_x^2 + B_x^2} \hfill &{{R_x}{R_y} + {G_x}{G_y} + {B_x}{B_y}} \hfill \\ {{R_x}{R_y} + {G_x}{G_y} + {B_x}{B_y}} \hfill &{R_y^2 + G_y^2 + B_y^2} \hfill \\ }<!end array> } \right] $$
(2)

where R x , R y , G x , G y , B x and B y are first-order of R, G, B channel along horizontal or vertical direction respectively. This definition can be seen as a simple extension of the second moment matrix to color, and it has been successfully used to extend first order operators to color.

Applying the singular value decomposition [38], this matrix G can be decomposed into G = UΛV T, where U is the left singular vectors of G and then gives the color saliency boosting function g:

$$ g(f) = U{\Lambda^{{ - 1/2}}}{V^T}f $$
(3)

Note that due to its linearity, the color saliency boosting function g can be directly applied to the original image as a preprocessing operation, shown in Fig. 1. To evaluate the impact of our method, we conducted edge detection experiments based on three different inputs, i.e. the original image, the color boosted image using [36] and the color boosted image using our method. Figure 2 illustrates three examples from dataset [1] for edge detection using canny detector. It is obvious that RGB edges are more biased by luminance and color boosted edges based on our method are more visually apparent than edges based on [36]. Moreover, the standard quantitative evaluation is given in Table 5. Please see Table 5 for more details.

Fig. 2
figure 2

Comparison for edge detection using canny detector. From top to bottom: Original images, RGB color edges, color boosted edges based on [36] and color boosted edges based on our method

4 Region based contrast

Humans tend to focus on those image regions that contrast strongly with their neighboring regions [43]. Similar to [8], there are two factors which are considered for evaluating region based saliency: the dissimilarity between image regions, and their spatial distance. With the increasing of the spatial distance between two regions, the influence of the dissimilarity between them is decreasing. Furthermore, the distance of each region with respect to the center of an image is introduced in the spatially weighted term because of central bias [34] that was ignored in [8]. In this paper, we first segment the entire image into regions, then calculate color contrast at the region level, and measure saliency for each region as weighted sum of the region’s contrast to all other regions in the image. For each region, its corresponding spatially weighting term is contrived to include three factors: the size of this region, spatial distances between this region and all other regions, and the center bias of this region.

Due to its simplicity and efficiency, we first initialize the segmentation using the modified version of graph-based image segmentation [45] to overcome some drawbacks in [13] according to following formulas:

$$ T(C) = \left( {{{{\left( {{W_{{\max }}} - {W_{{\min }}}} \right)}} \left/ {{\left| C \right|}} \right.}} \right)*\left( {{{{Nu{m_c}}} \left/ {k} \right.}} \right) $$
(4)
$$ Int(C) = {{1} \left/ {N} \right.}*\sum\nolimits_{{e \in MST\left( {C,E} \right)}} {W(e)} $$
(5)

where T(C) represents threshold function, W max is the largest edge weight in the weighted graph while W min is the smallest edge weight, |C| is the size of component C in pixels, Num c is the number of components in the image, parameter k can be regarded as the expected number of components and larger k produces more components, Int(C) denotes the internal difference, MST(C,E) denotes the minimal spanning tree of C, W(e) represents the edge weight in MST (C,E) and N is the number of edges in MST (C,E). We quickly review this segmentation algorithm. Initially, a graph is constructed over the entire image, with each pixel p being its own unique component. Subsequently, components are merged by traversing the edges in a sorted order by increasing weight and evaluating whether the edge weight is smaller than the internal difference of both components incident to the edge. If true, the components are merged and the internal difference of the compound component is updated.

After segmentation, we then calculate region contrast by color histogram comparison. To build this color histogram while consider efficiency, we take the sparse color histogram representation proposed in [8]. In [8], each color channel was quantized into 12 different values, resulting in the number of colors into a small one 123 = 1728. Based on the key observation that color in a natural image typically covers a small portion of the full color space, the authors in [8] further reduced the number of colors by choosing more frequently occurring colors that covered the colors of more than 95% of the image pixels, and ignoring less frequently occurring ones. The final sparse color histogram had around 85 colors, where the remaining pixels, that comprised fewer than 5% of the image pixels, were substituted by the closest colors in the histogram. Figure 3 shows that given an input image (left), we execute the color reduction procedure as discussed above. The quantized image (right) uses only 85 histogram bin colors and still remains sufficient visual quality for saliency detection.

Fig. 3
figure 3

The color quantization result for an input image

By integrating the dissimilarity of regions Dissmilarity, the size of regions ω2, spatial distance ω3 and central bias ω1, for a region r p , we measure its saliency values to all other regions in the image as:

$$ Saliency\left( {{r_p}} \right) = {\omega_1}\left( {{r_p}} \right)\sum\limits_{{{r_p} \ne {r_q}}} {{\omega_2}\left( {{r_q}} \right){\omega_3}\left( {{r_p},{r_q}} \right)Dissmilarity\left( {{r_p},{r_q}} \right)} $$
(6)

where ω2(r q ) is the weight of region r q and we use the number of pixels in r q as r q to emphasize color contrast to bigger regions. The weight ω1(r p ) is the first weighting mechanism we proposed, indicating a bias of region r p to the center of image, and is defined as:

$$ {\omega_1}\left( {{r_p}} \right) = {{{1 - BiasToCenter\left( {{r_p}} \right)}} \left/ {D} \right.} $$
(7)

where BiasToCenter(r p ) is the Euclidean distance between the centroid of region r p and the center of image, and D = max q {BiasToCenter(r q )} is a normalization factor.

In Eq. 6, the spatial weight ω3 (r p , r q ) is used to increase the effects of closer regions and decrease the effects of farther regions, representing the biological plausible characteristics. We calculate it as:

$$ {\omega_3}\left( {{r_p},{r_q}} \right) = \exp \left( {{{{ - {D_s}\left( {{r_p},{r_q}} \right)}} \left/ {{\sigma_s^2}} \right.}} \right) $$
(8)

where D s (r p , r q ) is the spatial distance between regions r p and r q , computing by the Euclidean distance between their centroids, and σ s controls the strength of spatial weighting. In fact, larger values of σ s reduce the effect of spatial weighting so that contrast to farther regions would contribute more to the saliency of the current region. In our experiments, \( \sigma_s^2 \) was empirically set as 0.4.

The term Dissmilarity (r p , r q ) in Eq. 6 is the dissimilarity between two regions r p , r q and we define it according to color distance between them as:

$$ Dissmilarity\left( {{r_p},{r_q}} \right) = \sum\limits_{{i = 1}}^m {\sum\limits_{{j = 1}}^n {f\left( {{c_{{p,i}}}} \right)} } f\left( {{c_{{q,j}}}} \right)D\left( {{c_{{p,i}}},{c_{{q,j}}}} \right) $$
(9)

where f(c p,i ) is the probability of the i-th color c p,i among all m colors in the r p region, f(c q,j ) is the probability of the j-th color c q,j among all n colors in the r q region and the color distance between color c p,i and c q,j is denoted by D(c p,i , c q,j ). Here, probability of a color in the probability density function (i.e. normalized color histogram) of the region is used as the weight for this color in order to emphasize more the color differences between dominant colors.

Figure 4 demonstrates the impact of introducing spatial distance ω3 and central bias ω1 in our method. We first use the method in [45] to segment the whole image and results are shown is the left of Fig. 4. Then, we compute the corresponding saliency map with (middle–right) and without (middle–left) spatial distance and central bias weighting. It is obvious that, by incorporating spatial distance and central bias weighting, we get a high quality saliency map when compared to the real human fixation density map (right).

Fig. 4
figure 4

The impact of introducing spatial distance and central bias weighting on saliency computation

5 Experimental results

We have evaluated the results of our approach on four public databases provided by Bruce et al. [5], Judd et al. [24], Linde et al. [37] and Achanta et al. [1]. We compared the proposed method with eight state-of-the-art saliency methods, e.g. RC [8], IT [23], SR [20], FT [1], LC [44], GB [19], CL [21] and IM [5].

Please note that in our comparison experiments, the compared algorithms are different for different databases. The reasons are as follows: the authors in [1] pointed out that, “The true usefulness of a saliency map is determined by the application”. In our experiments, using the method of Receiver Operator Characteristic (ROC) curve, we found the area under the curve (AUC) of FT [1] method was smaller than that of GB [19] method. Some authors also reported the similar observation [12]. Therefore, for fair comparison, we sliced experiments into two groups: we compared our method with RC [8], IT [23], GB [19], CL [21] and IM [5] according to the AUC criterion on the three databases: Bruce, Judd and Linde, and compared our method with RC [8], SR [20], FT [1] and LC [44] on the Achanta database based on saliency-guided segmentation trials.

We implemented our method in C++. For the other methods namely, GB [19], CL [21], IM [5] and IT [23], we used the author’s implementations. In addition, we downloaded source codes, kindly provided by the authors of [8] on website, that include implementations of the RC [8], SR [20], FT [1] and LC [44] methods.

5.1 Results on Bruce’s color image database

The first color image database we used was introduced by Bruce et al. [5]. 120 different color images are contained in this database, and 20 different subjects’ fixations are recorded for each image (all the image sizes are 681 × 511). To objectively evaluate the performance of different methods, we utilize the area under Receiver Operator Characteristic (ROC) curve [5] (i.e. the area under the curve (AUC)) to compare the saliency maps with the human fixations.

Figure 5 shows visual comparison of saliency maps obtained by proposed method and five state-of-the-art methods (IT [23], IM [5], CL [21], GB [19] and RC [8]). The fixation density maps are generated from the sum of all 2D Gaussians approximations of the drop-off the density of the human fixations [5]. From Fig. 5, it is clear that the most salient locations (represented as the brightest regions) of our saliency map are more consistent with human fixation density map. For example, in the second image, the green-yellow colored melon is attended by humans, but it is not detected to be the salient region by all other saliency detection methods expect ours.

Fig. 5
figure 5

Visual comparison of saliency maps between our method and other five approaches on Bruce’s color image database. The rows from top to bottom are: the input images, the saliency maps of the IT method [23], IM method [5], CL method [21], GB method [19], RC method [8], our method, and human fixation density maps

Table 1 lists the results on Bruce’s color image database for six different methods. The ROC curves are generated by taking Harel et al. [19]’s code. It is clear that our method outperforms the other five methods on predicting human fixations. Note that the results of IM, CL and GB methods are different from the corresponding results published in their paper [5, 19, 21]. That is because the sampling density which we used to obtain the threshold is different from what they used. However, in our experiments, they were all evaluated according to the same validation approach, so their relative performance should not be affected.

Table 1 Performances on the Bruce’s color image database

We have also investigated the benefits of adding three weighting factors ω1, ω2 and ω3 in our method. By setting ω1 to 1, the AUC decreases from 0.8943 to 0.8726, and by setting ω2 to 1, the AUC decreases from 0.8943 to 0.8838. Besides, if we use the flat values, i.e., ω1, ω2 and ω3 are all set to 1, the AUC drastically decreases from 0.8943 to 0.7940. Therefore, all three spatially weighting factors play a significant role in computing saliency.

Furthermore, the ROC curves of six different methods on Bruce’s color image database are shown in Fig. 6 which demonstrates that our method obtains higher hit rates and lower false positive rates.

Fig. 6
figure 6

The ROC curves of our method and other five approaches on Bruce’s color image database

5.2 Results on Judd’s color image database

We conducted our method on Judd’s color image database [24]. There are 1003 natural images including different scenes and objects, and the corresponding human fixations are also recorded from 15 subjects. In this database, the size of the images is not the same, e.g., the width varies from 682 to 1024 pixels, and the height varies from 628 to 1024 pixels. The comparison between saliency maps using different methods is shown in Fig. 7.

Fig. 7
figure 7

Visual comparison of saliency maps between our method and other five approaches on Judd’s color image database. The rows from top to bottom are: the input images, the saliency maps of the IT method [23], IM method [5], CL method [21], GB method [19], RC method [8], our method, and human fixation density maps

The AUC results are listed in Table 2 and our method achieved highest AUC results. Furthermore, we compared the ROC curves in Fig. 8 which demonstrates that our method again obtains higher hit rates and lower false positive rates.

Table 2 Performances on the Judd’s color image database
Fig. 8
figure 8

The ROC curves of our method and other five approaches on Judd’s color image database

5.3 Results on DOVES gray image database

The DOVES gray image database we used in our experiments was introduced by van der Linde et al. [37]. This database collects eye movements from 29 human observers when they view 101 natural calibrated images. The database consists of around 30,000 fixation points, and is believed to be the one of the large-scale databases of eye movements to be made available to the vision research community.

Here, we compared our saliency map with that of IT [23], GB [19] and RC [8] methods, as shown in Fig. 9. Please note that IM [5] and CL [21] methods are excluded now, because they could not process gray images. Here, for gray images, the parameters hold the same values as before (e.g. the quantization in the clustering stage also takes 12 values). Based on the AUC, the qualitative results are also listed in Table 3. The corresponding ROC curves are shown in Fig. 10. Our method performs relatively poorly on DOVES gray image database. That is because the color image contains semantic objects, but the gray image contains more raw signals. However, our method is also superior to other three approaches on this database.

Fig. 9
figure 9

Visual comparison of saliency maps between our method and other three approaches on DOVES gray image database. The rows from top to bottom are: the input images, the saliency maps of the IT method [23], GB method [19], RC method [8], our method, and human fixation density maps

Table 3 Performances on the DOVES gray image database
Fig. 10
figure 10

The ROC curves of our method and other three approaches on DOVES gray image database

5.4 Results on Achanta’s color image database

Finally, we evaluated the results of our method on Achanta’s color image database [1]. This database contains 1003 color images, including different objects and scenes. To the best of our knowledge, the database is the largest of its kind, and has ground truth in the form of accurate human-marked labels for salient regions. Figure 11 shows saliency maps computed by our approach and other four methods (e.g. SR [20], FT [1], LC [44] and RC [8]).

Fig. 11
figure 11

Visual comparison of saliency maps between our method and other four approaches on Achanta’s Color Image Database. The rows from top to bottom are: the input images, the saliency maps of the SR method [20], FT method [1], LC method [44], RC method [8], our method, and manually label ground truth

To quantitatively measure the accuracy of our method on these publicly available benchmark images [1], we conducted the saliency-guided segmentation experiment. In the experiment, to segment salient regions and compute precision and recall curves [20], like the fixed thresholding experiment in [1], we binarized the saliency map using every possible fixed threshold. In other words, we vary the threshold values from 0 to 255 to reliably compare how well various saliency detection methods highlight salient regions in images. Figure 12 describes the resulting precision-recall curves generated from our method and other four methods. Our method again achieves the highest performance.

Fig. 12
figure 12

Precision-recall curves for naïve thresholding of saliency maps using Achanta’s color image database

5.5 Discussion

In our proposed method, the color saliency boosting function is applied to the original image as a preprocessing step and we found it plays an important role in achieving more accurate saliency maps. Table 4 demonstrates different performances of our method on two databases when using the color saliency boosting function or not. The performances in Table 4 are evaluated according to AUC and it is clear that with the assistant of color saliency boosting, our method could be greatly improved.

Table 4 Performances on two databases with or without color saliency boosting

To further compare the advantage of our proposed color saliency boosting function over the method presented in [36], we conducted experiments on two databases using the AUC criterion, as shown in Table 5. Obviously, our method obtains relatively better results.

Table 5 Performances on two databases using different color saliency boosting functions

The proposed method is quite simple and efficient with no parameters are required in the main body of our method. The only parameter k, that required in Eq. 4, needs to be determined when initializing segmentation, and we set it to be 50 for all experiments.

Table 6 compares the average time taken by each method in the database by Achanta et al. [1]. Frankly speaking, our method is a little slower than four methods (e.g. SR, FT, LC and RC), but it can produce superior quality saliency maps.

Table 6 Average time taken to compute a saliency map for images in the database by Achanta et al. [1]. All methods were tested using a Dual Core 2.4 GHz machine with 2GB RAM

6 Application: saliency based image classification

Image classification is a fundamental problem in computer vision. While steady progress has been made toward this objective, the gap between the capabilities of the primate visual system and state-of-the-art object recognition systems remains vast. In this section, we try to apply the inherent capability of human brain known as visual attention to image classification. Figure 13 describes the framework of our saliency based image classification method.

Fig. 13
figure 13

The framework of saliency based image classification

Our method is similar to NIMBLE framework [3]. For readers who are interested in such technique, please refer to the literature [3] for more details. However, most of its details, such as its features and saliency map model, have been replaced in our method. In fact, we use Independent Component Analysis (ICA) features and our proposed color boosted visual saliency detection model for image classification.

We evaluated our saliency based image classification method on the public dataset: Caltech-256 [17]. The Caltech-256 dataset [17] contains 29,780 images falling into 256 categories with much higher intra-class variability and higher object location variability. Figure 14 shows classification examples from Caltech-256 dataset.

Fig. 14
figure 14

Classification examples from the Caltech-256 dataset

In classification experiments, we performed 5-fold random cross validation. During per cross-validation run, for each class, n training images were randomly selected, where n was varied, and up to 30 test images randomly were chosen (distinct from the training images) unless fewer than 30 were available in which case all of the available images were used. After each cross-validation run, we computed the mean per class accuracy. We finally reported the mean accuracy and compared our results against recent state-of-the art results [4, 15, 17, 30, 42].

Figure 15 shows our Caltech-256 results compared with other recent methods [4, 15, 17, 30, 42]. It is clear that our results are well enough compared to other approaches using a single feature type, but they are inferior to the method when multiple feature types are included. For instance, Gehler and Nowozin [15] used five feature types to train a SVM then used boosting to combine the kernels to achieve higher classification accuracy whereas our method used a single feature type.

Fig. 15
figure 15

The performance of our saliency based image classification on the Caltech-256 [17] dataset compared to five state-of-the-art approaches [4, 15, 17, 30, 42]

7 Conclusion and future work

In this paper, we presented a visual saliency detection method that combines the respective merits of color saliency boosting and global region based contrast schemes to compute high accurate saliency maps. We comprehensively evaluated our method on four publicly available databases and compared our approach with eight other state-of-the-art methods. Experimental results clearly indicate the proposed method to be superior in terms of both visual comparison and quantitative analysis. Toward practical application, we also demonstrate how the extracted saliency map can be used for image classification.

There is still room for improvement. For instance, it may be useful to incorporate high level features such as human faces, symmetry into saliency map. In addition, we try to develop a more efficient saliency detection algorithm to better handle special scenes that of cluttered and textured background. We believe the proposed saliency detection method can be applied to many areas, e.g., efficient object detection and robust image scene analysis.